XML Viewer - c02-1050

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1050_metho.xml
Size: 14,551 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1050">
  <Title>Bidirectional Decoding for Statistical Machine Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Statistical Machine Translation
</SectionTitle>
    <Paragraph position="0"> Statistical machine translation regards machine translation as a process of translating a source lan-</Paragraph>
    <Paragraph position="2"> guage text (f) into a target language text (e) with the following formula: e=arg maxe P(ejf) The Bayes Rule is applied to the above to derive: e=arg maxe P(fje)P(e) The translation process is treated as a noisy channel model, like those used in speech recognition in which there exists e transcribed as f, and a translation is to infer the best e from f in terms of P(fje)P(e). The former term, P(fje), is a translation model representing some correspondence between bilingual text. The latter, P(e), is the language model denoting the likelihood of the channel source text. In addition, a word correspondence model, called alignment a, is introduced to the translation model to represent a positional correspondence of the channel target and source words:</Paragraph>
    <Paragraph position="4"> An example of an alignment is shown in Figure 1, where the English sentence could you recommend another hotel is mapped onto the Japanese hoka no hoteru o shokaishi teitadake masu ka , and both hoka and no are aligned to another , etc. The NULL symbol at index 0 is also a lexical entry in which no morpheme is aligned from the channel target morpheme, such as masu and ka in this Japanese example.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.1 IBM Model 4
</SectionTitle>
    <Paragraph position="0"> The IBM Model 4, main focus in this paper, is composed of the following models (see Figure 2): Lexical Model t( fje) : Word-for-word translation model, representing the probability of a source word f being translated into a target word e.</Paragraph>
    <Paragraph position="1"> Fertility Model n( je) : Representing the probability of a source word e generating words.</Paragraph>
    <Paragraph position="2"> Distortion Model d : The probability of distortion. In Model 4, the model is decomposed into two sets of parameters: d1( j c ijA(ei);B( f j)) : Distortion probability for head words. The head word is the rst of the target words generated from a source word a cept, that is the channel source word with fertility more than and equal to one. The head word position j is determined by the word classes of the previous source word, A(ei), and target word,B( f j), relative to the centroid of the previous source word, c i.</Paragraph>
    <Paragraph position="3"> d&gt;1( j j0jB( f j)) : Distortion probability for non-head words. The position of a non-head word j is determined by the word class and relative to the previous target word generated from the cept ( j0).</Paragraph>
    <Paragraph position="4"> NULL Translation Model p1 : A xed probability of inserting a NULL word after determining each target word f .</Paragraph>
    <Paragraph position="5"> For details, refer to Brown et al. (1993).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Search Problem
</SectionTitle>
      <Paragraph position="0"> The search problem of statistical machine translation is to induce the maximum likely channel source sequence, e, given f and the model, P(fje) =P a P(f;aje) and P(e). For the space of a is extremely large,jajl+1, where the l is the output length, an approximation of P(fje)'P(f;aje) is used when exploring the possible candidates of translation.</Paragraph>
      <Paragraph position="1"> This problem is known to be NP-Complete (Knight, 1999), for the re-ordering property in the model further complicates the search. One of the solution is the left-to-right generation of output by consuming input words in any-order. Under this constraint, many researchers had contributed algorithms and associated pruning strategies, such as Berger et al. (1996), Och et al. (2001), Wang and Waibel (1997), Tillmann and Ney (2000) Garcia-Varea and Casacuberta (2001) and Germann et al.</Paragraph>
      <Paragraph position="2"> (2001), though they all based on almost linearly  aligned language pairs, and not suitable for language pairs with totally di erent alignment correspondence, such as Japanese and English.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Decoding Algorithms
</SectionTitle>
    <Paragraph position="0"> The decoding methods presented in this paper explore the partial candidate translation hypotheses greedily, as presented in Tillmann and Ney (2000) and Och et al. (2001), and operation applied to each hypothesis is similar to those explained in Berger et al. (1996), Och et al. (2001) and Germann et al. (2001). The algorithm is depicted in Algorithm 1 where C = fjk : k = 1:::jCjg represents a set of input string position 1. The algorithm assumes two kinds of partial hypotheses2, translated partially from an input string, one is an open hypothesis that can be extended by raising the fertility. The other is a close hypothesis that is to be extended by inserting a string e0 to the hypothesis. The e0 is a sequence of output word, consisting of a word with the fertility more than one (translation of f j) and other words with zero fertility. The translation of f j can be computed either by inverse translation table (Och et al., 2001; Al-Onaizan et al., 1999). The list of zero fertility words can be obtained from the viterbi alignment of training corpus (Germann et al., 2001).</Paragraph>
    <Paragraph position="1"> The extension operator applied to an open hypothesis (e;C) is: align j to ei this creates a new hypothesis by raising the fertility of ei by consuming the input word f j. The generated hypothesis can be treated as either closed or open, that means to stop raising the fertility or raise the fertility further more.</Paragraph>
    <Paragraph position="2"> The operators applied to a close hypothesis are:  translation.</Paragraph>
    <Paragraph position="3"> Algorithm 1 Beam Decoding Search input source string: f1 f2:::fm  for all cardinality c=0;1;:::m 1 do for all (e;C) wherejCj=c do for all j=1;:::m and j &lt; C do if (e;C) is open then align j to ei and keep it open align j to ei and close it else align j to NULL insert e0, align from j and open it insert e0, align from j and close it</Paragraph>
    <Paragraph position="5"> align j to NULL raise the fertility for the NULL word.</Paragraph>
    <Paragraph position="6"> insert e0, align from j this operator insert a string e0 and align one input word f j to one of the word in e0. After this operation, the new hypothesis can be regarded as either open or closed.</Paragraph>
    <Paragraph position="7"> Pruning is inevitable in the process of decoding, and applied is the beam search pruning, in which the maximum number of hypotheses to be considered is limited. In addition, fertility pruning is also introduced which suppress the word with large number of fertility. The skipping based criteria, such as introduced by Och et al. (2001), is not appropriate for the language pairs with drastically di erent alignment, such as Japanese and English, hence was not considered in this paper. Depending on the output generation direction, the algorithm can generate either in left-to-right or right-to-left, by alternating some constraints of insertion of output words.</Paragraph>
    <Paragraph position="9"> the partial output string, e, and the last word in e0 was aligned from f j.</Paragraph>
    <Paragraph position="11"> the partial output string, e, and the rst word in e0 was aligned from f j.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Left-to-Right Decoding
</SectionTitle>
      <Paragraph position="0"> The left-to-right decoding enforces the restriction where the insertion of e0 is allowed after the partially generated e, and alignment from the input word f j is restricted to the end of the word of e0.</Paragraph>
      <Paragraph position="1"> Hence, the operator applied to an open hypothesis raise the fertility for the word at the end of e (refer to Figure 3).</Paragraph>
      <Paragraph position="2"> The language which place emphasis around the beginning of a sentence, such as English, will be suitable in this direction, for the Language Model score P(e) can estimate what should come rst.</Paragraph>
      <Paragraph position="3"> Hence, the decoder can discriminate a hypothesis better or not.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Right-to-Left Decoding
</SectionTitle>
      <Paragraph position="0"> The right-to-left decoding does the reverse of the left-to-right decoding, in which the insertion of e0 is allowed only before the e and the f j is aligned to the beginning of the word of e0 (see Figure 4).</Paragraph>
      <Paragraph position="1"> Therefore, the open hypothesis is extended by raising the fertility of the beginning of the word of e. In prepending a string to a partial hypothesis, an alignment vector should be reassigned so that the values can point out correct index.</Paragraph>
      <Paragraph position="2"> Again, the right-to-left direction is suitable for the language which enforces stronger constraints at the end of sentence, such as Japanese, similar to the reason mentioned above.</Paragraph>
      <Paragraph position="3"> ef 1 ::: ei ::: eblbe</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Bidirectional Decoding
</SectionTitle>
      <Paragraph position="0"> The bidirectional decoding decode the input words in both direction, one with left-to-right decoding method up to the cardinality ofdm=2eand right-to-left direction up to the cardinality of bm=2c, where m is the input length. Then, the two hypotheses are merged when both are open and can share the same output word e, which resulted in raising the fertility of e. If both of them are closed hypotheses, then an additional sequence of zero fertility words (or NULL sequence) are inserted (refer to Figure 5).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Computational Complexity
</SectionTitle>
      <Paragraph position="0"> The computational complexity for the left-to-right and right-to-left is the same, O(jEj3m22m), as reported by Tillmann and Ney (2000), in which jEj is the size of the vocabulary for output sentences 3.</Paragraph>
      <Paragraph position="1"> The bidirectional method involves merging of two hypotheses, hence additional O(</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 E ects of Decoding Direction
</SectionTitle>
      <Paragraph position="0"> The decoding algorithm generating in left-to-right direction lls the output sequence from the beginning of a sentence by consuming the input words in any order and by selecting the corresponding translation. null Therefore, the languages with pre x structure, such as English, German or French, can take the bene ts of this direction, because the language model/translation model can di erentiate good hypotheses to bad hypotheses around the beginning of the output sentences. Therefore, the narrowing the search space by the beam search crite3The termjEj3 is the case for trigram language model. ria (pruning) would not a ect the overall quality.</Paragraph>
      <Paragraph position="1"> On the other hand, if right-to-left decoding method were applied to such a language above, the difference of good hypotheses and bad hypotheses is small, hence the drop of hypotheses would a ect the quality of translation.</Paragraph>
      <Paragraph position="2"> The similar statement can hold for post x languages, such as Japanese, where emphasis is placed around the end of a sentence. For such languages, right-to-left decoding will be suitable but left-to-right decoding will degrade the quality of translation. null The bidirectional decoding is expected to take the bene ts of both of the directions, and will show the best results in any kind of languages.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The corpus for this experiment consists of 172,481 bilingual sentences of English and Japanese extracted from a large-scale travel conversation corpus (Takezawa et al., 2002). The statistics of the corpus are shown in Table 1. The database was split into three parts: a training set of 152,183 sentence pairs, a validation set of 10,148, and a test set of 10,150.</Paragraph>
    <Paragraph position="1"> The translation models, both for the Japanese-to-English (J-E) and English-to-Japanese (E-J) translation, were trained toward IBM Model 4 on the training set and cross-validated on validation set to terminate the iteration by observing perplexity. In modeling IBM Model 4, POSs were used as word classes.</Paragraph>
    <Paragraph position="2"> From the viterbi alignments of the training corpus, A list of possible insertion of zero fertility words were extracted with frequency more than 10, around 1,300 sequences of words for both of the J-E and E-J translations. The test set consists of 150 Japanese sentences varying by the sentence length of 6, 8 and 10. The translation was carried out by three decoding methods:left-to-right, right-to-left and bidirectional one.</Paragraph>
    <Paragraph position="3"> The translation results were evaluated by worderror-rate (WER) and position independent worderror-rate (PER) (Watanabe et al., 2002; Och et al., 2001). The WER is the measure by penalizing insertion/deletion/replacement by 1. The PER is the one similar to WER but ignores the positions, allowing the reordered outputs, hence can estimate the accuracy for the tranlslation word selection. It has been also evaluated by subjective evaluation (SE) with the criteria ranging from A(perfect) to D(non null evaluated with WER, PER and SE. Table 3 shows the ratio of producing search errors, computed by comparing the translation model and lnguage model scores for the outputs from three decoding methods.</Paragraph>
    <Paragraph position="4"> Sample Japanese-to-English translations performed by the decoders is presented in Figure 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML