File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1069_metho.xml
Size: 31,067 bytes
Last Modified: 2025-10-06 14:09:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1069"> <Title>A Localized Prediction Model for Statistical Machine Translation</Title> <Section position="3" start_page="557" end_page="561" type="metho"> <SectionTitle> 2 Block Orientation Bigrams </SectionTitle> <Paragraph position="0"> This section describes a phrase-based model for SMT similar to the models presented in (Koehn et al., 2003; Och et al., 1999; Tillmann and Xia, 2003). In our paper, phrase pairs are named blocks and our model is designed to generate block sequences. We also model the position of blocks relative to each other: this is called orientation. To define block sequences with orientation, we define the notion of block orientation bigrams. Starting point for collecting these bigrams is a block set</Paragraph> <Paragraph position="2"> sisting of a source phrase a91 and a target phrase a92 . a101 is the source phrase length anda102 is the target phrase length.</Paragraph> <Paragraph position="3"> Single source and target words are denoted by a94a4a103 and</Paragraph> <Paragraph position="5"> We will also use a special single-word block set a90 a18a96a108 a90 which contains only blocks for which a101 a72 a102 a72 a2 . For the experiments in this paper, the block set is the one used in (Al-Onaizan et al., 2004). Although this is not investigated in the present paper, different blocksets may be used for computing the block statistics introduced in this paper, which may effect translation results.</Paragraph> <Paragraph position="6"> For the block set a90 and a training sentence pair, we carry out a two-dimensional pattern matching algorithm to find adjacent matching blocks along with their position in the coordinate system defined by source and target positions (see Fig. 2). Here, we do not insist on a consistent block coverage as one would do during decoding. Among the matching blocks, two blocks a9a12a11 and a9 are adjacent if the target phrases a92 and a92a96a11 as well as the source phrases a91 and a91 a11 are adjacent. a9a12a11 is predecessor of block a9 if a9a12a11 and a9 are adjacent and a9a12a11 occurs below a9 . A right adjacent successor block a9 is said to have right orientation a10 a72a130a81 orientation. 'left' and 'right' are defined relative to thea131 axis ; 'below' is defined relative to thea132 axis. For some discussion on global re-ordering see Section 6.</Paragraph> <Paragraph position="7"> tion a10 a72a133a78 . There are matching blocks a9 that have no predecessor, such a block has neutral orientation (a10 a72a89a74 ). After matching blocks for a training sentence pair, we look for adjacent block pairs to collect block bigram orientation events a134 of the type a134</Paragraph> <Paragraph position="9"> a19a107a9a107a21 . Our model to be presented in Section 3 is used to predict a future block orientation pair a14 a9a4a19 a10 a21 given its predecessor block history a9a12a11 . In Fig. 1, the following block orientation bigrams occur: a14 a105a135a19 a74 a19a107a9 a18 a21 ,a14 a9 a18 a19 a78 a19a43a9 a76 a21 ,a14 a105a26a19 a74 a19a43a9 a79 a21 ,a14 a9 a79 a19 a81 a19a43a9 a80 a21 . Collecting orientation bigrams on all parallel sentence pairs, we obtain an orientation bigram list a134a4a136a18 :</Paragraph> <Paragraph position="11"> Here, a146 a143 is the number of orientation bigrams in the a94 -th sentence pair. The total number a74 of orientation bigrams</Paragraph> <Paragraph position="13"> a5a3 million for our training data consisting of a91 a72a151a150a127a148a77a152a154a153a34a153a34a153 sentence pairs. The orientation bigram list is used for the parameter training presented in Section 3. Ignoring the bigrams with neutral orientation a74 reduces the list defined in Eq. 2 to about</Paragraph> <Paragraph position="15"> a153 million orientation bigrams. The Neutral orientation is handled separately as described in Section 5. Using the reduced orientation bigram list, we collect unigram orientation counts a74a157a156 a14 a9a100a21 : how often a block occurs with a given orientation a10 a82a130a84 a78 a19 a81 a87 . a74a96a158 a14 a9a107a21a160a159 a153 a5a150 a155 a105 a74a96a161 a14 a9a100a21 typically holds for blocks a9 involved in block swapping and the orientation model</Paragraph> <Paragraph position="17"> In order to train a block bigram orientation model as described in Section 3.2, we define a successor set a163</Paragraph> <Paragraph position="19"> for a block a9a12a11 in the a94 -th training sentence pair:</Paragraph> <Paragraph position="21"> The successor set a163a34a14 a9a12a11a86a21 is defined for each event in the list a134a4a136a18 . The average size of a163a34a14 a9a54a11a86a21 is a2a114a5a155 successor blocks. If we were to compute a Viterbi block alignment for a training sentence pair, each block in this block alignment would have at most a2 successor: Blocks may have several successors, because we do not inforce any kind of consistent coverage during training.</Paragraph> <Paragraph position="22"> During decoding, we generate a list of block orientation bigrams as described above. A DP-based beam search procedure identical to the one used in (Tillmann, 2004) is used to maximize over all oriented block segmentations a14 a9a88a16a18 a19 a10 a16a18 a21 . During decoding orientation bi-grams a14 a9 a11 a19 a78 a19a43a9a107a21 with left orientation are only generated if a74a157a158 a14 as a feature-vector a169a170a14 a9a4a19 a10 a168a77a9 a11 a19 a10 a11 a21 a82a144a171 a172 . For a model that uses all the components defined below, a173 isa7 . As feature-vector components, we take the negative logarithm of some block model probabilities. We use the term 'float' feature for these feature-vector components (the model score is stored as a float number). Additionally, we use binary block features. The letters (a)-(f) refer to Table 1: Unigram Models: we compute (a) the unigram proba- null These probabilities are simple relative frequency estimates based on unigram and unigram orientation counts derived from the data in Eq. 2. For details see (Tillmann, 2004). During decoding, the uni-gram probability is normalized by the source phrase length.</Paragraph> <Paragraph position="23"> Two types of Trigram language model: (c) probability of predicting the first target word in the target clump of a9 a24 given the final two words of the target clump of a9 a24a164a30 a18 , (d) probability of predicting the rest of the words in the target clump of a9 a24 . The language model is trained on a separate corpus.</Paragraph> <Paragraph position="24"> Lexical Weighting: (e) the lexical weight</Paragraph> <Paragraph position="26"> of the block a9 a72 a14a12a91 a19a86a92a93a21 is computed similarly to (Koehn et al., 2003), details are given in Section 3.4.</Paragraph> <Paragraph position="27"> Binary features: (f) binary features are defined using an indicator function a169a170a14 a9a43a19a43a9a12a11a86a21 which is a2 if the block pair a14 a9a4a19a43a9a12a11a86a21 occurs more often than a given threshold a74 , e.g a74a174a72a175a150 . Here, the orientation a10 between the blocks is ignored.</Paragraph> <Paragraph position="29"/> <Section position="1" start_page="558" end_page="558" type="sub_section"> <SectionTitle> 3.1 Global Model </SectionTitle> <Paragraph position="0"> In our linear block model, for a given source sentence a94 , each translation is represented as a sequence of block/orientation pairs a84 a9a88a16a18 a19 a10 a16a18 a87 consistent with the source. Using features such as those described above, we can parameterize the probability of such a sequence as a13a15a14 a9a17a16a18 a19 a10 a16a18 a29a177 a19 a94 a21 , wherea177 is a vector of unknown model parameters to be estimated from the training data. We use a log-linear probability model and maximum likelihood training-- the parameter a177 is estimated by maximizing the joint likelihood over all sentences. Denote by a178a179a14a12a94 a21 the set of possible block/orientation sequences a84 a9a17a16a18 a19 a10 a16a18 a87 that are consistent with the source sentence a94 , then a log-linear probability model can be represented as</Paragraph> <Paragraph position="2"> where a169a170a14 a9a17a16a18 a19 a10 a16a18 a21 denotes the feature vector of the corresponding block translation, and the partition function is: A disadvantage of this approach is that the summation over a178a179a14a86a94 a21 can be rather difficult to compute. Consequently some sophisticated approximate inference methods are needed to carry out the computation. A detailed investigation of the global model will be left to another study.</Paragraph> </Section> <Section position="2" start_page="558" end_page="559" type="sub_section"> <SectionTitle> 3.2 Local Model Restrictions </SectionTitle> <Paragraph position="0"> In the following, we consider a simplification of the direct global model in Eq. 4. As in (Tillmann, 2004), we model the block bigram probability as</Paragraph> <Paragraph position="2"> only in the context of immediate neighbors for blocks that have left or right orientation. The log-linear model is defined as:</Paragraph> <Paragraph position="4"> where a94 is the source sentence, a169a170a14 a9a4a19 a10 a168a77a9a12a11a164a19 a10 a11a86a21 is a locally defined feature vector that depends only on the current and the previous oriented blocks a14 a9a43a19 a10 a21 and a14 a9a12a11a167a19 a10 a11a167a21 . The features were described at the beginning of the section.</Paragraph> <Paragraph position="5"> The partition function is given by</Paragraph> <Paragraph position="7"> The set a178a179a14 a9a12a11a167a19 a10 a11a167a168 a94 a21 is a restricted set of possible successor oriented blocks that are consistent with the current block position and the source sentence a94 , to be described in the following paragraph. Note that a straightforward normalization over all block orientation pairs in Eq. 5 is not feasible: there are tens of millions of possible successor blocks a9 (if we do not impose any restriction).</Paragraph> <Paragraph position="8"> For each block a9 a72 a14a86a91 a19a12a92a93a21 , aligned with a source sentence a94 , we define a source-induced alternative set:</Paragraph> <Paragraph position="10"> a90 that share an identical source phrase with a9a198a87 The set a90 a14 a9a107a21 contains the block a9 itself and the block target phrases of blocks in that set might differ. To restrict the number of alternatives further, the elements of a90 a14 a9a107a21 are sorted according to the unigram count a74 a14 a9a12a11a11a86a21 and we keep at most the top a202 blocks for each source interval a94 . We also use a modified alternative set a90 a18 a14 a9a107a21 , where the block a9 as well as the elements in the set</Paragraph> <Paragraph position="12"> a9a107a21 are single word blocks. The partition function is computed slightly differently during training and decoding: Training: for each event a14 a9a12a11a164a19 a10 a19a43a9a100a21 in a sentence pair a94 in Eq. 2 we compute the successor set a163 a143 a14 a9a12a11a167a21 . This defines a set of 'true' block successors. For each true successor a9 , we compute the alternative set a90 a14 a9a107a21 . a21 is the union of the alternative set for each successor a9 . Here, the orientation a10 from the true successor a9 is assigned to each alternative in a90 a14 a9a107a21 . We obtain on the average a2 a150 a5a3 alternatives per training event a14</Paragraph> <Paragraph position="14"> Decoding: Here, each block a9 that matches a source interval following a9a54a11 in the sentence a94 is a potential successor. We simply seta178a179a14 a9 a11 a19 a10 a11 a168 a94 a21 a72 a90 a14 a9a107a21 . Moreover, setting a185 a14 a9a12a11a164a19 a10 a11a12a168 a94 a21 a72a203a153 a5a155 during decoding does not change performance: the list a90 a14 a9a107a21 just restricts the possible target translations for a source phrase.</Paragraph> <Paragraph position="15"> Under this model, the log-probability of a possible translation of a source sentence a94 , as in Eq. 1, can be</Paragraph> <Paragraph position="17"> In the maximum-likelihood training, we find a177 by maximizing the sum of the log-likelihood over observed sentences, each of them has the form in Eq. 7. Although the training methodology is similar to the global formulation given in Eq. 4, this localized version is computationally much easier to manage since the summation in the partition function a185 a14 a21 is now over a relatively small set of candidates. This computational advantage is the main reason that we adopt the local model in this paper.</Paragraph> </Section> <Section position="3" start_page="559" end_page="559" type="sub_section"> <SectionTitle> 3.3 Global versus Local Models </SectionTitle> <Paragraph position="0"> Both the global and the localized log-linear models described in this section can be considered as maximum-entropy models, similar to those used in natural language processing, e.g. maximum-entropy models for POS tagging and shallow parsing. In the parsing context, global models such as in Eq. 4 are sometimes referred to as conditional random field or CRF (Lafferty et al., 2001).</Paragraph> <Paragraph position="1"> Although there are some arguments that indicate that this approach has some advantages over localized models such as Eq. 5, the potential improvements are relatively small, at least in NLP applications. For SMT, the difference can be potentially more significant. This is because in our current localized model, successor blocks of different sizes are directly compared to each other, which is intuitively not the best approach (i.e., probabilities of blocks with identical lengths are more comparable).</Paragraph> <Paragraph position="2"> This issue is closely related to the phenomenon of multiple counting of events, which means that a source/target sentence pair can be decomposed into different oriented blocks in our model. In our current training procedure, we select one as the truth, while consider the other (possibly also correct) decisions as non-truth alternatives. In the global modeling, with appropriate normalization, this issue becomes less severe. With this limitation in mind, the localized model proposed here is still an effective approach, as demonstrated by our experiments. Moreover, it is simple both computationally and conceptually.</Paragraph> <Paragraph position="3"> Various issues such as the ones described above can be addressed with more sophisticated modeling techniques, which we shall be left to future studies.</Paragraph> </Section> <Section position="4" start_page="559" end_page="561" type="sub_section"> <SectionTitle> 3.4 Lexical Weighting </SectionTitle> <Paragraph position="0"> a97a100a21 is derived from the block set itself rather than from a word alignment, resulting in a simplified training. The lexical weight is computed as follows:</Paragraph> <Paragraph position="2"> Here, the single-word-based translation probability</Paragraph> <Paragraph position="4"> and a9a12a11 a72 a14a12a94a77a103 a19a12a97a88a208a46a21 are single-word blocks, where source and target phrases are of length a2 . a74 a206 a14a86a94a4a103 a19a12a97 a99a18a107a21 is the num- null The local model described in Section 3 leads to the following abstract maximum entropy training formulation:</Paragraph> <Paragraph position="6"> In this formulation,a177 is the weight vector which we want to compute. The set a178 a24 consists of candidate labels for the a106 -th training instance, with the true label a132 a24 a82 a178 a24 . The labels here are block identities , a178 a24 corresponds to the alternative set a178a221a14 a9a12a11a164a19 a10 a11a86a168 a94 a21 and the 'true' blocks are defined by the successor set a163a34a14 vector a169a170a14 a9a4a19 a10 a168a77a9a12a11a164a19 a10 a11a86a21 . This formulation is slightly different from the standard maximum entropy formulation typically encountered in NLP applications, in that we restrict the summation over a subset a178 a24 of all labels.</Paragraph> <Paragraph position="7"> Intuitively, this method favors a weight vector such that for each a106 , a177a157a184 a131 a24 a190a220a107a218a40a222 a177a157a184 a131 a24 a190a103 is large when a104a85a223a72 a132 a24 . This effect is desirable since it tries to separate the correct classification from the incorrect alternatives. If the problem is completely separable, then it can be shown that the computed linear separator, with appropriate regularization, achieves the largest possible separating margin. The effect is similar to some multi-category generalizations of support vector machines (SVM). However, Eq. 8 is more suitable for non-separable problems (which is often the case for SMT) since it directly models the conditional probability for the candidate labels.</Paragraph> <Paragraph position="8"> A related method is multi-category perceptron, which explicitly finds a weight vector that separates correct labels from the incorrect ones in a mistake driven fashion (Collins, 2002). The method works by examining one sample at a time, and makes an updatea177a225a224a226a177 a162 a14a117a131 a24 a190a220a43a218a129a222 a21 is not positive. To compute the update for a training instancea106 , one usually pick thea104 such thata177a112a184 a14a117a131 a24 a190a220a43a218a28a222 a131 a24 a190a103 a21 is the smallest. It can be shown that if there exist weight vectors that separate the correct label a132 a24 from incorrect labels a104 a82 a178 a24 for all a104a85a223a72 a132 a24 , then the perceptron method can find such a separator. However, it is not entirely clear what this method does when the training data are not completely separable. Moreover, the standard mistake bound justification does not apply when we go through the training data more than once, as typically done in practice. In spite of some issues in its justification, the perceptron algorithm is still very attractive due to its simplicity and computational efficiency. It also works quite well for a number of NLP applications.</Paragraph> <Paragraph position="9"> In the following, we show that a simple and efficient online training procedure can also be developed for the maximum entropy formulation Eq. 8. The proposed update rule is similar to the perceptron method but with a soft mistake-driven update rule, where the influence of each feature is weighted by the significance of its mistake. The method is essentially a version of the so-called stochastic gradient descent method, which has been widely used in complicated stochastic optimization problems such as neural networks. It was argued recently in (Zhang, 2004) that this method also works well for standard convex formulations of binary-classification problems including SVM and logistic regression. Convergence bounds similar to perceptron mistake bounds can be developed, although unlike perceptron, the theory justifies the standard practice of going through the training data more than once. In the non-separable case, the method solves a regularized version of Eq. 8, which has the statistical interpretation of estimating the conditional probability. Consequently, it does not have the potential issues of the perceptron method which we pointed out earlier. Due to the nature of online update, just like perceptron, this method is also very simple to implement and is scalable to large problem size. This is important in the SMT application because we can have a huge number of training instances which we are not able to keep in memory at the same time.</Paragraph> <Paragraph position="10"> In stochastic gradient descent, we examine one training instance at a time. At the a106 -th instance, we derive the update rule by maximizing with respect to the term associated with the instance in Eq. 8. We do a gradient descent localized to this instance as a177a227a224a228a177 a222a230a229 a24a96a231 Similar to online algorithms such as the perceptron, we apply this update rule one by one to each training instance (randomly ordered), and may go-through data points repeatedly. Compare Eq. 9 to the perceptron update, there are two main differences, which we discuss below.</Paragraph> <Paragraph position="11"> The first difference is the weighting scheme. Instead of putting the update weight to a single (most mistaken) feature component, as in the perceptron algorithm, we use a soft-weighting scheme, with each feature component a104 weighted by a factor a180a43a182a129a183 a14a177a112a184 a131 a24 a190a103 a21a12a232 a208 with larger a177a112a184 a131 a24 a190a103 gets more weight. This effect is in principle similar to the perceptron update. The smoothing effect in Eq. 9 is useful for non-separable problems since it does not force an update rule that attempts to separate the data. Each feature component gets a weight that is proportional to its conditional probability.</Paragraph> <Paragraph position="12"> The second difference is the introduction of a learning rate parameter a229 a24 . For the algorithm to converge, one should pick a decreasing learning rate. In practice, however, it is often more convenient to select a fixed a229 a24 a72 a229 for all a106 . This leads to an algorithm that approximately solve a regularized version of Eq. 8. If we go through the data repeatedly, one may also decrease the fixed learning rate by monitoring the progress made each time we go through the data. For practical purposes, a fixed small a229 such as a229 a72 a2 a153 a30a46a233 is usually sufficient. We typically run forty updates over the training data. Using techniques similar to those of (Zhang, 2004), we can obtain a convergence theorem for our algorithm. Due to the space limitation, we will not present the analysis here.</Paragraph> <Paragraph position="13"> An advantage of this method over standard maximum entropy training such as GIS (generalized iterative scaling) is that it does not require us to store all the data in memory at once. Moreover, the convergence analysis can be used to show that if a234 is large, we can get a very good approximate solution by going through the data only once. This desirable property implies that the method is particularly suitable for large scale problems.</Paragraph> </Section> </Section> <Section position="4" start_page="561" end_page="562" type="metho"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> The translation system is tested on an Arabic-to-English translation task. The training data comes from the UN news sources. Some punctuation tokenization and some number classing are carried out on the English and the Arabic training data. In this paper, we present results for two test sets: (1) the devtest set uses data provided by LDC, which consists of a2 a153 a8 a152 sentences with a150 a155 a3a213a3a34a202 Arabic words with a8 reference translations. (2) the blind test set is the MT03 Arabic-English DARPA evaluation test set consisting of a7a34a7 a152 sentences with a2a77a7 a150a98a148 a3 Arabic words with also a8 reference translations. Experimental results are reported in Table 2: here cased BLEU results are reported on MT03 Arabic-English test set (Papineni et al., 2002). The word casing is added as post-processing step using a statistical model (details are omitted here).</Paragraph> <Paragraph position="1"> In order to speed up the parameter training we filter the original training data according to the two test sets: for each of the test sets we take all the Arabic substrings up to length a2 a150 and filter the parallel training data to include only those training sentence pairs that contain at least one out of these phrases: the 'LDC' training data contains about a150a98a148a77a152 thousand sentence pairs and the 'MT03' training data contains abouta150a213a152a34a153 thousand sentence pairs. Two block sets are derived for each of the training sets using a phrase-pair selection algorithm similar to (Koehn et al., 2003; Tillmann and Xia, 2003). These block sets also include blocks that occur only once in the training data.</Paragraph> <Paragraph position="2"> Additionally, some heuristic filtering is used to increase phrase translation accuracy (Al-Onaizan et al., 2004).</Paragraph> <Section position="1" start_page="561" end_page="561" type="sub_section"> <SectionTitle> 5.1 Likelihood Training Results </SectionTitle> <Paragraph position="0"> We compare model performance with respect to the number and type of features used as well as with respect to different re-ordering models. Results for a202 experiments are shown in Table 2, where the feature types are described in Table 1. The first a155 experimental results are obtained by carrying out the likelihood training described in Section 3. Line a2 in Table 2 shows the performance of the baseline block unigram 'MON' model which uses two 'float' features: the unigram probability and the boundary-word language model probability.</Paragraph> <Paragraph position="1"> No block re-ordering is allowed for the baseline model (a monotone block sequence is generated). The 'SWAP' model in line a150 uses the same two features, but neighbor blocks can be swapped. No performance increase is obtained for this model. The 'SWAP & OR' model uses an orientation model as described in Section 3. Here, we obtain a small but significant improvement over the base-line model. Linea8 shows that by including two additional 'float' features: the lexical weighting and the language model probability of predicting the second and subsequent words of the target clump yields a further significant improvement. Line a155 shows that including binary features and training their weights on the training data actually decreases performance. This issue is addressed in Section 5.2.</Paragraph> <Paragraph position="2"> The training is carried out as follows: the results in line a2 -a8 are obtained by training 'float' weights only. Here, the training is carried out by running only once over a2 a153 % of the training data. The model including the binary features is trained on the entire training data. We obtain about a152 a5a152a98a148 million features of the type defined in Eq. 3 by setting the threshold a74a235a72a236a152 . Forty iterations over the training data take abouta150 hours on a single Intel machine. Although the online algorithm does not require us to do so, our training procedure keeps the entire training data and the weight vectora177 in about a150 gigabytes of memory. For blocks with neutral orientation a10 a72a227a74 , we train a separate model that does not use the orientation model feature or the binary features. E.g. for the results in line model is trained on the neutral orientation bigram subsequence that is part of Eq. 2.</Paragraph> </Section> <Section position="2" start_page="561" end_page="562" type="sub_section"> <SectionTitle> 5.2 Modified Weight Training </SectionTitle> <Paragraph position="0"> We implemented the following variation of the likelihood training procedure described in Section 3, where we make use of the 'LDC' devtest set. First, we train a model on the 'LDC' training data using a155 float features and the binary features. We use this model to decode intervals on the MT03 test data. The third column summarizes the model variations. The results in lines a3 and a202 are for a cheating experiment: the float weights are trained on the test data itself.</Paragraph> <Paragraph position="1"> the devtest 'LDC' set. During decoding, we generate a 'translation graph' for every input sentence using a procedure similar to (Ueffing et al., 2002): a translation graph is a compact way of representing candidate translations which are close in terms of likelihood. From the translation graph, we obtain the a2 a153a213a153a34a153 best translations according to the translation score. Out of this list, we find the block sequence that generated the top BLEU-scoring target translation. Computing the top BLEU-scoring block sequence for all the input sentences we obtain: where a74 a11a160a22 a202a213a8 a153a34a153 . Here, a74 a11 is the number of blocks needed to decode the entire devtest set. Alternatives for each of the events in a134a77a136 a145a18 are generated as described in Section 3.2. The set of alternatives is further restricted by using only those blocks that occur in some translation in the a2 a153a34a153a34a153 -best list. The a155 float weights are trained on the modified training data in Eq. 10, where the training takes only a few seconds. We then decode the 'MT03' test set using the modified 'float' weights. As shown in line a8 and line a7 there is almost no change in performance between training on the original training data in Eq. 2 or on the modified training data in Eq. 10. Line a3 shows that even when training the float weights on an event set obtained from the test data itself in a cheating experiment, we obtain only a moderate performance improvement from a152a98a148 a5 a148 to a152 a202a6a5a153 . For the experimental results in line a148 and a202 , we use the same five float weights as trained for the experiments in line a7 and a3 and keep them fixed while training the binary feature weights only.</Paragraph> <Paragraph position="2"> Using the binary features leads to only a minor improvement in BLEU from a152a98a148 a5a3 to a152 a3a6a5a150 in line a148 . For this best model, we obtain a a2a77a3a241a5a7 % BLEU improvement over the baseline.</Paragraph> <Paragraph position="3"> From our experimental results, we draw the following conclusions: (1) the translation performance is largely dominated by the 'float' features, (2) using the same set of 'float' features, the performance doesn't change much when training on training, devtest, or even test data. Although, we do not obtain a significant improvement from the use of binary features, currently, we expect the use of binary features to be a promising approach for the following reasons: a242 The current training does not take into account the block interaction on the sentence level. A more accurate approximation of the global model as discussed in Section 3.1 might improve performance.</Paragraph> <Paragraph position="4"> a242 As described in Section 3.2 and Section 5.2, for efficiency reasons alternatives are computed from source phrase matches only. During training, more accurate local approximations for the partition function in Eq. 6 can be obtained by looking at block translations in the context of translation sequences.</Paragraph> <Paragraph position="5"> This involves the computationally expensive generation of a translation graph for each training sentence pair. This is future work.</Paragraph> <Paragraph position="6"> a242 As mentioned in Section 1, viewing the translation process as a sequence of local discussions makes it similar to other NLP problems such as POS tagging, phrase chunking, and also statistical parsing. This similarity may facilitate the incorporation of these approaches into our translation model.</Paragraph> </Section> </Section> class="xml-element"></Paper>