File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1091_metho.xml
Size: 23,636 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1091"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Discriminative Global Training Algorithm for Statistical MT</Title> <Section position="4" start_page="721" end_page="722" type="metho"> <SectionTitle> 2 Block Sequence Model </SectionTitle> <Paragraph position="0"> This paper views phrase-based SMT as a block sequence generation process. Blocks are phrase pairs consisting of target and source phrases and local phrase re-ordering is handled by including so-called block orientation. Starting point for the block-based translation model is a block set, e.g.</Paragraph> <Paragraph position="1"> about a91a89a92a93a88 million Arabic-English phrase pairs for the experiments in this paper. This block set is used to decode training sentence to obtain block orientation sequences that are used in the discriminative parameter training. Nothing but the block set and the parallel training data is used to carry out the training. We use the block set described in (Al-Onaizan et al., 2004), the use of a different block set may effect translation results.</Paragraph> <Paragraph position="2"> Rather than predicting local block neighbors as in (Tillmann and Zhang, 2005) , here the model parameters are trained in a global setting. Starting with a simple model, the training data is decoded multiple times: the weight vector is trained to discriminate block sequences with a high translation score against block sequences with a high BLEU score 2. The high BLEU scoring block sequences are obtained as follows: the regular phrase-based decoder is modified in a way that it uses the BLEU score as optimization criterion (independent of any translation model). Here, searching for the highest BLEU scoring block sequence is restricted to local re-ordering as is the model-based decoding (as shown in Fig. 1). The BLEU score is computed with respect to the single reference translation provided by the parallel training data. A block sequence with an average BLEU score of about a94a89a92a93a88a63a95 is obtained for each training sentence 3. The 'true' maximum ing sentence pair separately (treating each sentence pair as a single-sentence corpus with a single reference) and then averaged over all training sentences. Although block sequences are found with a high BLEU score on average there is no guarantee to find the maximum BLEU block sequence for a given sentence pair. The target word sequence corresponding to a block sequence does not have to match the reference translation, i.e. maximum BLEU scores are quite low for some training sentences.</Paragraph> <Paragraph position="3"> blocka96 sequences are represented by high dimensional feature vectors using the binary features defined below and the translation process is handled as a multi-class classification problem in which each block sequence represents a possible class.</Paragraph> <Paragraph position="4"> The effect of this training procedure can be seen in Figure 2: each decoding step on the training data adds a high-scoring block sequence to the discriminative training and decoding performance on the training data is improved after each iteration (along with the test data decoding performance).</Paragraph> <Paragraph position="5"> A theoretical justification for the novel training procedure is given in Section 3.</Paragraph> <Paragraph position="6"> We now define the feature components for the block bigram feature vector a97a30a7 a0 a67 a12 a1 a67 a12a76a0 a67a69a77 a11 a14 in Eq. 1. Although the training algorithm can handle real-valued features as used in (Och, 2003; Tillmann and Zhang, 2005) the current paper intentionally excludes them. The current feature functions are similar to those used in common phrase-based translation systems: for them it has been shown that good translation performance can be achieved 4. A systematic analysis of the novel training algorithm will allow us to include much more sophisticated features in future experiments, i.e. POS-based features, syntactic or hierarchical features (Chiang, 2005). The dimensionality of the feature vector a97a15a7 a0 a67 a12 a1 a67 a12a76a0 a67a69a77 a11 a14 depends on the number of binary features. For illustration purposes, the binary features are chosen such that they yield a98 on the example block sequence in Fig. 1. There are phrase-based and word-based features: ture capturing the identity of a block. Additional phrase-based features include block orientation, target and source phrase bigram features. Word-based features are used as well, e.g. feature null BLEU score of a100a63a101a63a102a43a103 and (Ittycheriah and Roukos, 2005) reports a BLEU score of a104a89a103a63a102 a105 .</Paragraph> <Paragraph position="7"> pendencies similar to the use of Model a98 probabilities in (Koehn et al., 2003). Additionally, we use distortion features involving relative source word position and a106 -gram features for adjacent target words. These features correspond to the use of a language model, but the weights for theses features are trained on the parallel training data only. For the most complex model, the number of features is about a87a89a88 million (ignoring all features that occur only once).</Paragraph> </Section> <Section position="5" start_page="722" end_page="724" type="metho"> <SectionTitle> 3 Approximate Relevant Set Method </SectionTitle> <Paragraph position="0"> Throughout the section, we let a107 a55 a7 a0 a9a11 a12 a1 a9a11 a14 . Each block sequence a107 a55 a7 a0a66a9a11 a12 a1 a9a11 a14 corresponds to a candidate translation. In the training data where target translations are given, a BLEU score a108a110a109a111a7a112a107 a14 can be calculated for each a107 a55 a7 a0 a9a11 a12 a1 a9a11 a14 against the target translations. In this set up, our goal is to find a weight vector a70 such that the higher a4a63a5a8a7a112a107 a14 is, the higher the corresponding BLEU score a108a56a109a93a7a112a107 a14 should be. If we can find such a weight vector, then block decoding by searching for the highest a4 a5 a7a112a107 a14 will lead to good translation with high BLEU score.</Paragraph> <Paragraph position="1"> Formally, we denote a source sentence by a113 , and let a114a115a7a6a113 a14 be the set of possible candidate oriented block sequences a107 a55 a7 a0 a9a11 a12 a1 a9a11 a14 that the decoder can generate from a113 . For example, in a monotone decoder, the set a114a116a7a6a113 a14 contains block sequences a81 a0a66a9a11 a83 that cover the source sentence a113 in the same order. For a decoder with local re-ordering, the candidate set a114a80a7a6a113 a14 also includes additional block sequences with re-ordered block configurations that the decoder can efficiently search. Therefore depending on the specific implementation of the decoder, the set a114a80a7a6a113 a14 can be different. In general, a114a115a7a6a113 a14 is a subset of all possible oriented block sequences a81 a7 a0a66a9a11 a12 a1 a9a11 a14a78a83 that are consistent with input sentence a113 .</Paragraph> <Paragraph position="2"> Given a scoring function a4 a5 a7 a14 and an input sentence a113 , we can assume that the decoder implements the following decoding rule:</Paragraph> <Paragraph position="4"> Let a113 a11 a12 a92a76a92a76a92 a12 a113a21a134 be a set of a57 training sentences.</Paragraph> <Paragraph position="5"> Each sentence a113 a67 is associated with a set a114a116a7a6a113 a67 a14 of possible translation block sequences that are searchable by the decoder. Each translation block sequence a107 a79 a114a80a7a6a113 a67 a14 induces a translation, which is then assigned a BLEU score a108a110a109a111a7a112a107 a14 (obtained by comparing against the target translations). The goala135 of the training is to find a weight vector a70 such that for each training sentence a113 a67 , the corresponding decoder outputs a117a107 a79 a114a116a7a6a113 a67 a14 which has the maximum BLEU score among all a107 a79 a114a80a7a6a113 a67 a14 based on Eq. 2. In other words, if a117a107 maximizes the scoring function a4 a5 a7a112a107 a14 , then a117a107 also maximizes the BLEU metric.</Paragraph> <Paragraph position="6"> Based on the description, a simple idea is to learn the BLEU score a108a110a109a93a7a112a107 a14 for each candidate block sequence a107 . That is, we would like to estimate null a70 such that a4 a5 a7a112a107 a14a137a136 a108a110a109a93a7a112a107 a14 . This can be achieved through least squares regression. It is easy to see that if we can find a weight vector a70 that approximates a108a110a109a111a7a112a107 a14 , then the decoding-rule in Eq. 2 automatically maximizes the BLEU score.</Paragraph> <Paragraph position="7"> However, it is usually difficult to estimate a108a110a109a111a7a112a107 a14 reliably based only on a linear combination of the feature vector as in Eq. 1. We note that a good decoder does not necessarily employ a scoring function that approximates the BLEU score. Instead, we only need to make sure that the top-ranked block sequence obtained by the decoder scoring function has a high BLEU score. To formulate this idea, we attempt to find a decoding parameter such that for each sentence a113 in the training data, sequences in a114a80a7a6a113 a14 with the highest BLEU scores should get a4 a5 a7a112a107 a14 scores higher than those with low BLEU scores.</Paragraph> <Paragraph position="8"> Denote by a114a13a138a115a7a6a113 a14 a set of a139 block sequences in a114a116a7a6a113 a14 with the highest BLEU scores. Our decoded result should lie in this set. We call them the &quot;truth&quot;. The set of the remaining sequences</Paragraph> <Paragraph position="10"> a14 , which we shall refer to as the &quot;alternatives&quot;. We look for a weight vector</Paragraph> <Paragraph position="12"> where a157 is a non-negative real-valued loss function (whose specific choice is not critical for the purposes of this paper),and a149a160a159 a94 is a regularization parameter. In our experiments, results are obtained using the following convex loss</Paragraph> <Paragraph position="14"> tion scores, and a7a78a164 a14 a163 a55 a122a124a118a63a125 a7a78a94 a12 a164 a14 . We refer to this formulation as 'costMargin' (cost-sensitive margin) method: for each training sentence a165 'true' block sequence set a114 a138 a7a6a113 a14 and the 'alternative' block sequence set a114a80a7a6a113 a14 is maximized. Note that due to the truth and alternative set up, we always have a0a167a166a82a0a3a2 . This loss function gives an upper bound of the error we will suffer if the order of a4 and a4 a2 is wrongly predicted (that is, if we predict</Paragraph> <Paragraph position="16"> that if for the BLEU scores a0a72a136a82a0a10a2 holds, then the loss value is small (proportional to a0a170a140a171a0a10a2 ).</Paragraph> <Paragraph position="17"> A major contribution of this work is a procedure to solve Eq. 3 approximately. The main difficulty is that the search space a114a115a7a6a113 a14 covered by the decoder can be extremely large. It cannot be enumerated for practical purposes. Our idea is to replace this large space by a small subspace (as in our choice), then we know that the global optimal solution remains the same if the whole decoding space a114 is replaced by the relevant set a114 a130a186a172a76a133 . Each subspace a114 a130a186a172a76a133 a7a6a113 a67 a14 will be significantly smaller than a114a115a7a6a113 a14 . This is because it only includes those alternatives a107 a2 with score a4a185a187a5 a7a112a107 a2a69a14 close to one of the selected truth. These are the most important alternatives that are easily confused with the truth. Essentially the lemma says that if the decoder works well on these difficult alternatives (relevant points), then it works well on the whole space. The idea is closely related to active learning in standard classification problems, where we</Paragraph> <Paragraph position="19"> by solving Eq. 5 approximately (**) selectively pick the most important samples (often based on estimation uncertainty) for labeling in order to maximize classification performance (Lewis and Catlett, 1994). In the active learning setting, as long as we do well on the actively selected samples, we do well on the whole sample space. In our case, as long as we do well on the relevant set, the decoder will perform well.</Paragraph> <Paragraph position="20"> Since the relevant set depends on the decoder parameter a70 , and the decoder parameter is optimized on the relevant set, it is necessary to estimate them jointly using an iterative algorithm. The basic idea is to start with a decoding parame-</Paragraph> <Paragraph position="22"> based on the relevant set, and iterate this process. The procedure is outlined in Table 1. We intentionally leave the implementation details of the (*) step and (**) step open. Moreover, in this general algorithm, we do not have to assume that a4 a5 a7a112a107 a14 has the form of Eq. 1. A natural question concerning the procedure is its convergence behavior. It can be shown that under mild assumptions, if we pick in (*) an alternative a190a107a89a191 a79 a114a116a7a6a113 a14a131a140 a114a151a138a80a7a6a113 a14 for each a107a15a191 a79 a114a13a138a115a7a6a113 a14</Paragraph> <Paragraph position="24"> then the procedure converges to the solution of Eq. 3. Moreover, the rate of convergence depends only on the property of the loss function, and not on the size of a114a80a7a6a113 a14 . This property is critical as it shows that as long as Eq. 6 can be computed efficiently, then the Approximate Relevant Set algorithm is efficient. Moreover, it gives a bound on the size of an approximate relevant set with a certain accuracy.5 5Due to the space limitation, we will not include a for-The approximate solution of Eq. 5 in (**) can be implemented using stochastic gradient descent (SGD), where we may simply update The parameter a199a202a166 a94 is a fixed constant often referred to as learning rate. Again, convergence results can be proved for this procedure. Due to the space limitation, we skip the formal statement as well as the corresponding analysis.</Paragraph> <Paragraph position="25"> Up to this point, we have not assumed any specific form of the decoder scoring function in our algorithm. Now consider Eq. 1 used in our model. We may express it as: feature representation and the loss function in Eq. 4, we obtain the following costMargin SGD update rule for each training data point and a195 :</Paragraph> </Section> <Section position="6" start_page="724" end_page="726" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> We applied the novel discriminative training approach to a standard Arabic-to-English translation task. The training data comes from UN news sources. Some punctuation tokenization and some number classing are carried out on the English and the Arabic training data. We show translation results in terms of the automatic BLEU evaluation metric (Papineni et al., 2002) on the MT03 Arabic-English DARPA evaluation test set consisting of a212a89a212a89a87 sentences with a98a89a212a161a213a89a214a89a215 Arabic words with a95 reference translations. In order to speed up the parameter training the original training data is filtered according to the test set: all the Arabic substrings that occur in the test set are computed and the parallel training data is filtered to include only those training sentence pairs that contain at least one out of these phrases: the resulting pre-filtered training data contains about a213a89a87a89a94 thousand sentence pairs (a88a89a92a93a88a89a213 million Arabic words and a212a89a92a93a214a89a212 million English words). The block set is generated using a phrase-pair selection algorithm similar to (Koehn et al., 2003; Al-Onaizan et al., 2004), which includes some heuristic filtering to mal statement here. A detailed theoretical investigation of the method will be given in a journal paper.</Paragraph> <Paragraph position="1"> increasea216 phrase translation accuracy. Blocks that occur only once in the training data are included as well.</Paragraph> <Section position="1" start_page="725" end_page="726" type="sub_section"> <SectionTitle> 4.1 Practical Implementation Details </SectionTitle> <Paragraph position="0"> The training algorithm in Table 2 is adapted from a87a89a94 times over the parallel training data, each time decoding all the a57a218a55 a213a89a87a89a94a161a94a89a94a89a94 training sentences and generating a single block translation sequence for each training sentence. The top five block sequences a114a13a219a89a7a6a113 a67 a14 with the highest BLEU score are computed up-front for all training sentence pairs a165 a67 and are stored separately as described in Section 2. The score-based decoding of the a213a89a87a89a94a161a94a89a94a89a94 training sentence pairs is carried out in parallel on a213a89a88a184a212a63a95 -Bit Opteron machines. Here, the monotone decoding is much faster than the decoding with block swapping: the monotone decoding takes less than a94a89a92a93a88 hours and the decoding with swapping takes about an hour. Since the training starts with only the parallel training data and a block set, some initial block sequences have to be generated in order to initialize the global model training: for each input sentence a simple bag of blocks translation is generated. For each input interval that is matched by some block a0 , a single block is added to the bag-of-blocks translation a107</Paragraph> <Paragraph position="2"> in which the blocks are generated is ignored. For this block set only block and word identity features are generated, i.e. features of type a11a39a99a78a99a18a11 in Section 2. This step does not require the use of a decoder. The initial block sequence training data contains only a single alternative. The training procedure proceeds by iteratively decoding the training data. After each decoding step, the resulting translation block sequences are stored on disc in binary format. A block sequence generated at decoding step a189 a11 is used in all subsequent training steps a189a63a59 , where a189a154a59 gence with a theoretical guarantee, we should use Eq. 6 to update the relevant set, in reality, this idea is difficult to implement because it requires a more costly decoding step. Therefore in Table 2, we adopt an approximation, where the relevant set is updated by adding the decoder output at each stage. In this way, we are able to treat the decoding iterations, a222 = number of training sentences.</Paragraph> <Paragraph position="3"> for each input sentence a113 scheme as a black box. One way to approximate Eq. 6 is to generate multiple decoding outputs and pick the most relevant points based on Eq. 6. Since the a85 -best list generation is computationally costly, only a single block sequence is generated for each training sentence pair, reducing the memory requirements for the training algorithm as well. Although we are not able to rigorously prove fast convergence rate for this approximation, it works well in practice, as Figure 2 shows. Theoretically this is because points achieving large values in Eq. 6 tend to have higher chances to become the top-ranked decoder output as well. The SGD-based on-line training algorithm described in Section 3, is carried out after each decoding step to generate the weight vector a70 for the subsequent decoding step. Since this training step is carried out on a single machine, it dominates the overall computation time. Since each iteration adds a single relevant alternative to the set a114 a130a186a172a154a133 a7a6a113 a67 a14 , computation time increases with the number of training iterations: the initial model is trained in a few minutes, while training the model after the a87a89a94 -th iteration takes up to a88 hours for the most complex models.</Paragraph> <Paragraph position="4"> Table 3 presents experimental results in terms of uncased BLEU 6. Two re-ordering restrictions are tested, i.e. monotone decoding ('MON'), and local block re-ordering where neighbor blocks can be swapped ('SWAP'). The 'SWAP' re-ordering uses the same features as the monotone models plus additional orientation-based and distortion- null based features. Different feature sets include word-based features, phrase-based features, and the combination of both. For the results with word-based features, the decoder still generates phrase-to-phrase translations, but all the scoring is done on the word level. Line a215 shows a BLEU score of a87a89a212a89a92a93a87 for the best performing system which uses all word-based and phrase-based features 7.</Paragraph> <Paragraph position="5"> Line a98 and line a88 of Table 3 show the training data averaged BLEU score obtained by searching for the highest BLEU scoring block sequence for each training sentence pair as described in Section 2. Allowing local block swapping in this search procedure yields a much improved BLEU score of a94a89a92a93a88a89a91 . The experimental results show that word-based models significantly outperform phrase-based models, the combination of word-based and phrase-based features performs better than those features types taken separately. Additionally, swap-based re-ordering slightly improves performance over monotone decoding. For all experiments, the training BLEU score remains significantly lower than the maximum obtainable BLEU score shown in line a98 and line a88 . In this respect, there is significant room for improvements in terms of feature functions and alternative set generation. The word-based models perform surprisingly well, i.e. the model in line a214 uses only three feature types: model a98 features like a75 a11a39a99a78a99a18a11 in Section 2, distortion features, and target language m-gram features up to a106 a55 a87 . Training speed varies depending on the feature types used: for the simplest model shown in line a213 of Table 3, the training takes about a98a89a213 hours, for the models us7With a margin of a226a227a105a89a102 a105a128a228a229a104 , the differences between the results in line a104 , line a101 , and line a103 are not statistically significant, but the other result differences are.</Paragraph> <Paragraph position="6"> ence) and the test set (lower graph; BLEU with four references) as a function of the training iteration a230 for the model corresponding to line a215 in Table 3.</Paragraph> <Paragraph position="7"> ing word-based features shown in line a87 and line a214 training takes less than a213 days. Finally, the training for the most complex model in line a215 takes about a95 days.</Paragraph> <Paragraph position="8"> Figure 2 shows the BLEU performance for the model corresponding to line a215 in Table 3 as a function of the number of training iterations. By adding top scoring alternatives in the training algorithm in Table 2, the BLEU performance on the training data improves from about a94a89a92a93a213a89a213 for the initial model to about a94a89a92 a95a224a215 for the best model after a87a89a94 iterations. After each training iteration the test data is decoded as well. Here, the BLEU performance improves from a94a89a92a93a94a89a215 for the initial model to about a94a89a92a93a87a89a212 for the final model (we do not include the test data block sequences in the training). Table 3 shows a typical learning curve for the experiments in Table 3: the training BLEU score is much higher than the test set BLEU score despite the fact that the test set uses a95 reference translations.</Paragraph> </Section> </Section> class="xml-element"></Paper>