Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 721–728,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A Discriminative Global Training Algorithm for Statistical MT
Christoph Tillmann
IBM T.J. Watson Research Center
Yorktown Heights, N.Y. 10598
ctill@us.ibm.com
Tong Zhang
Yahoo! Research
New York City, N.Y. 10011
tzhang@yahoo-inc.com
Abstract
This paper presents a novel training al-
gorithm for a linearly-scored block se-
quence translation model. The key com-
ponent is a new procedure to directly op-
timize the global scoring function used by
a SMT decoder. No translation, language,
or distortion model probabilities are used
as in earlier work on SMT. Therefore
our method, which employs less domain
specific knowledge, is both simpler and
more extensible than previous approaches.
Moreover, the training procedure treats the
decoder as a black-box, and thus can be
used to optimize any decoding scheme.
The training algorithm is evaluated on a
standard Arabic-English translation task.
1 Introduction
This paper presents a view of phrase-based SMT
as a sequential process that generates block ori-
entation sequences. A block is a pair of phrases
which are translations of each other. For example,
Figure 1 shows an Arabic-English translation ex-
ample that uses four blocks. During decoding, we
view translation as a block segmentation process,
where the input sentence is segmented from left
to right and the target sentence is generated from
bottom to top, one block at a time. A monotone
block sequence is generated except for the possi-
bility to handle some local phrase re-ordering. In
this local re-ordering model (Tillmann and Zhang,
2005; Kumar and Byrne, 2005) a block a0 with
orientation a1 is generated relative to its predeces-
sor block a0a3a2 . During decoding, we maximize the
score a4a6a5a8a7 a0a10a9a11a13a12 a1 a9a11a15a14 of a block orientation sequence
a16a18a17
a19a21a20a18a22a21a23a25a24a21a20a18a26a27a20
a28a30a29a32a31a15a33
a23a18a34a32a20
a35
a23a18a36a3a37
a33
a23a25a24a21a20a18a26
a38a3a26a27a36a3a23a18a20
a33a39a29
a40
a33
a41
a40a43a42
a36
a40
a34
a42
a40
a33
a44
a36
a22a45
a46
a40
a33
a40
a26
a36
a40
a45
a33
a45
a46
a34
a24
a34a47
a48
a23
a29
a36a3a26a27a37a21a23a18a49a27a20
a33
a33
a22
a24
a40
a24
a45
a40
a33
a50a43a51
a40
a33
a40
a33a51
a35a45
a40
a16a18a52
a16a25a53
a16a18a54
Figure 1: An Arabic-English block translation ex-
ample, where the Arabic words are romanized.
The following orientation sequence is generated:
a1
a11a56a55a58a57
a12
a1a6a59
a55a61a60
a12
a1a63a62
a55a58a57
a12
a1a6a64
a55a61a65 .
a7
a0a66a9a11 a12
a1
a9a11 ):
a4 a5 a7
a0 a9a11 a12
a1
a9a11 a14 a55
a9
a67a69a68
a11a21a70a72a71a74a73a15a75
a7
a0
a67
a12
a1
a67
a12a76a0
a67a69a77
a11
a14a78a12 (1)
where a0 a67 is a block, a0 a67a69a77 a11 is its predecessor block,
and a1 a67a80a79a82a81 a60 a7 efta14a78a12 a65 a7 ighta14a78a12 a57 a7 eutrala14a78a83 is a three-
valued orientation component linked to the block
a0
a67 : a block is generated to the left or the right of
its predecessor block a0 a67a84a77 a11 , where the orientation
a1
a67a69a77
a11 of the predecessor block is ignored. Here,
a85
is the number of blocks in the translation. We are
interested in learning the weight vector
a70
from the
training data.
a75
a7
a0
a67
a12
a1
a67
a12a76a0
a67a69a77
a11
a14 is a high-dimensional
binary feature representation of the block orienta-
tion pair a7 a0 a67 a12 a1 a67 a12a76a0 a67a84a77 a11 a14 . The block orientation se-
721
quencea86 is generated under the restriction that the
concatenated source phrases of the blocks a0 a67 yield
the input sentence. In modeling a block sequence,
we emphasize adjacent block neighbors that have
right or left orientation, since in the current exper-
iments only local block swapping is handled (neu-
tral orientation is used for ’detached’ blocks as de-
scribed in (Tillmann and Zhang, 2005)).
This paper focuses on the discriminative train-
ing of the weight vector
a70
used in Eq. 1. The de-
coding process is decomposed into local decision
steps based on Eq. 1, but the model is trained in
a global setting as shown below. The advantage
of this approach is that it can easily handle tens of
millions of features, e.g. up to a87a89a88 million features
for the experiments in this paper. Moreover, under
this view, SMT becomes quite similar to sequen-
tial natural language annotation problems such as
part-of-speech tagging and shallow parsing, and
the novel training algorithm presented in this pa-
per is actually most similar to work on training al-
gorithms presented for these task, e.g. the on-line
training algorithm presented in (McDonald et al.,
2005) and the perceptron training algorithm pre-
sented in (Collins, 2002). The current approach
does not use specialized probability features as in
(Och, 2003) in any stage during decoder parame-
ter training. Such probability features include lan-
guage model, translation or distortion probabili-
ties, which are commonly used in current SMT
approaches 1. We are able to achieve comparable
performance to (Tillmann and Zhang, 2005). The
novel algorithm differs computationally from ear-
lier work in discriminative training algorithms for
SMT (Och, 2003) as follows:
a90 No computationally expensive
a57 -best lists
are generated during training: for each input
sentence a single block sequence is generated
on each iteration over the training data.
a90 No additional development data set is neces-
sary as the weight vector
a70
is trained on bilin-
gual training data only.
The paper is structured as follows: Section 2
presents the baseline block sequence model and
the feature representation. Section 3 presents
the discriminative training algorithm that learns
1A translation and distortion model is used in generating
the block set used in the experiments, but these translation
probabilities are not used during decoding.
a good global ranking function used during de-
coding. Section 4 presents results on a standard
Arabic-English translation task. Finally, some dis-
cussion and future work is presented in Section 5.
2 Block Sequence Model
This paper views phrase-based SMT as a block
sequence generation process. Blocks are phrase
pairs consisting of target and source phrases and
local phrase re-ordering is handled by including
so-called block orientation. Starting point for the
block-based translation model is a block set, e.g.
about a91a89a92a93a88 million Arabic-English phrase pairs for
the experiments in this paper. This block set is
used to decode training sentence to obtain block
orientation sequences that are used in the discrim-
inative parameter training. Nothing but the block
set and the parallel training data is used to carry
out the training. We use the block set described
in (Al-Onaizan et al., 2004), the use of a different
block set may effect translation results.
Rather than predicting local block neighbors as in
(Tillmann and Zhang, 2005) , here the model pa-
rameters are trained in a global setting. Starting
with a simple model, the training data is decoded
multiple times: the weight vector
a70
is trained to
discriminate block sequences with a high trans-
lation score against block sequences with a high
BLEU score 2. The high BLEU scoring block
sequences are obtained as follows: the regular
phrase-based decoder is modified in a way that
it uses the BLEU score as optimization criterion
(independent of any translation model). Here,
searching for the highest BLEU scoring block se-
quence is restricted to local re-ordering as is the
model-based decoding (as shown in Fig. 1). The
BLEU score is computed with respect to the sin-
gle reference translation provided by the paral-
lel training data. A block sequence with an av-
erage BLEU score of about a94a89a92a93a88a63a95 is obtained for
each training sentence 3. The ’true’ maximum
BLEU block sequence as well as the high scoring
2High scoring block sequences may contain translation er-
rors that are quantified by a lower BLEU score.
3The training BLEU score is computed for each train-
ing sentence pair separately (treating each sentence pair as
a single-sentence corpus with a single reference) and then av-
eraged over all training sentences. Although block sequences
are found with a high BLEU score on average there is no
guarantee to find the maximum BLEU block sequence for a
given sentence pair. The target word sequence correspond-
ing to a block sequence does not have to match the refer-
ence translation, i.e. maximum BLEU scores are quite low
for some training sentences.
722
blocka96 sequences are represented by high dimen-
sional feature vectors using the binary features de-
fined below and the translation process is handled
as a multi-class classification problem in which
each block sequence represents a possible class.
The effect of this training procedure can be seen
in Figure 2: each decoding step on the training
data adds a high-scoring block sequence to the dis-
criminative training and decoding performance on
the training data is improved after each iteration
(along with the test data decoding performance).
A theoretical justification for the novel training
procedure is given in Section 3.
We now define the feature components for the
block bigram feature vector a97a30a7 a0 a67 a12 a1 a67 a12a76a0 a67a69a77 a11 a14 in Eq. 1.
Although the training algorithm can handle real-
valued features as used in (Och, 2003; Tillmann
and Zhang, 2005) the current paper intentionally
excludes them. The current feature functions are
similar to those used in common phrase-based
translation systems: for them it has been shown
that good translation performance can be achieved
4. A systematic analysis of the novel training algo-
rithm will allow us to include much more sophis-
ticated features in future experiments, i.e. POS-
based features, syntactic or hierarchical features
(Chiang, 2005). The dimensionality of the fea-
ture vector a97a15a7 a0 a67 a12 a1 a67 a12a76a0 a67a69a77 a11 a14 depends on the number
of binary features. For illustration purposes, the
binary features are chosen such that they yield a98
on the example block sequence in Fig. 1. There
are phrase-based and word-based features:
a75
a11a39a99a78a99a78a99
a7
a0
a67
a12
a1
a67
a12a76a0
a67a69a77
a11
a14
a55
a55
a98 block
a0
a67 consists of target phrase
’violate’ and source phrase ’tnthk’
a94 otherwise
a75
a11a39a99a78a99a18a11
a7
a0
a67
a12
a1
a67
a12a76a0
a67a69a77
a11
a14
a55
a55
a98 ’Lebanese’ is a word in the target
phrase of block a0 a67 and ’AllbnAny’
is a word in the source phrase
a94 otherwise
The feature
a75
a11a39a99a78a99a78a99 is a ’unigram’ phrase-based fea-
ture capturing the identity of a block. Addi-
tional phrase-based features include block orien-
tation, target and source phrase bigram features.
Word-based features are used as well, e.g. fea-
ture
a75
a11a39a99a78a99a18a11 captures word-to-word translation de-
4On our test set, (Tillmann and Zhang, 2005) reports a
BLEU score of a100a63a101a63a102a43a103 and (Ittycheriah and Roukos, 2005) re-
ports a BLEU score of a104a89a103a63a102 a105 .
pendencies similar to the use of Model a98 probabil-
ities in (Koehn et al., 2003). Additionally, we use
distortion features involving relative source word
position and a106 -gram features for adjacent target
words. These features correspond to the use of
a language model, but the weights for theses fea-
tures are trained on the parallel training data only.
For the most complex model, the number of fea-
tures is about a87a89a88 million (ignoring all features that
occur only once).
3 Approximate Relevant Set Method
Throughout the section, we let a107 a55 a7 a0 a9a11 a12 a1 a9a11 a14 . Each
block sequence a107 a55 a7 a0a66a9a11 a12 a1 a9a11 a14 corresponds to a can-
didate translation. In the training data where target
translations are given, a BLEU score a108a110a109a111a7a112a107 a14 can be
calculated for each a107 a55 a7 a0 a9a11 a12 a1 a9a11 a14 against the tar-
get translations. In this set up, our goal is to find
a weight vector
a70
such that the higher a4a63a5a8a7a112a107 a14 is,
the higher the corresponding BLEU score a108a56a109a93a7a112a107 a14
should be. If we can find such a weight vector,
then block decoding by searching for the high-
est a4 a5 a7a112a107 a14 will lead to good translation with high
BLEU score.
Formally, we denote a source sentence by a113 ,
and let a114a115a7a6a113 a14 be the set of possible candidate ori-
ented block sequences a107 a55 a7 a0 a9a11 a12 a1 a9a11 a14 that the de-
coder can generate from a113 . For example, in a
monotone decoder, the set a114a116a7a6a113 a14 contains block
sequences a81 a0a66a9a11 a83 that cover the source sentence
a113 in the same order. For a decoder with lo-
cal re-ordering, the candidate set a114a80a7a6a113 a14 also in-
cludes additional block sequences with re-ordered
block configurations that the decoder can effi-
ciently search. Therefore depending on the spe-
cific implementation of the decoder, the set a114a80a7a6a113 a14
can be different. In general, a114a115a7a6a113
a14 is a subset of all
possible oriented block sequences a81 a7 a0a66a9a11 a12 a1 a9a11 a14a78a83 that
are consistent with input sentence a113 .
Given a scoring function a4 a5 a7
a73
a14 and an input sen-
tence a113 , we can assume that the decoder imple-
ments the following decoding rule:
a117
a107a21a7a6a113
a14
a55a61a118a63a119a121a120a123a122a124a118a63a125
a126a128a127a18a129a131a130a111a132a30a133
a4 a5 a7a112a107
a14
a92 (2)
Let a113 a11 a12 a92a76a92a76a92 a12 a113a21a134 be a set of a57 training sentences.
Each sentence a113 a67 is associated with a set a114a116a7a6a113 a67 a14
of possible translation block sequences that are
searchable by the decoder. Each translation block
sequence a107 a79 a114a80a7a6a113 a67 a14 induces a translation, which
is then assigned a BLEU score a108a110a109a111a7a112a107
a14 (obtained
by comparing against the target translations). The
723
goala135 of the training is to find a weight vector
a70
such that for each training sentence a113 a67 , the corre-
sponding decoder outputs a117a107 a79 a114a116a7a6a113 a67 a14 which has
the maximum BLEU score among all a107 a79 a114a80a7a6a113 a67 a14
based on Eq. 2. In other words, if a117a107 maximizes the
scoring function a4 a5 a7a112a107 a14 , then a117a107 also maximizes the
BLEU metric.
Based on the description, a simple idea is to
learn the BLEU score a108a110a109a93a7a112a107 a14 for each candidate
block sequence a107 . That is, we would like to es-
timate
a70
such that a4 a5 a7a112a107 a14a137a136 a108a110a109a93a7a112a107 a14 . This can be
achieved through least squares regression. It is
easy to see that if we can find a weight vector
a70
that approximates a108a110a109a111a7a112a107 a14 , then the decoding-rule in
Eq. 2 automatically maximizes the BLEU score.
However, it is usually difficult to estimate a108a110a109a111a7a112a107 a14
reliably based only on a linear combination of the
feature vector as in Eq. 1. We note that a good de-
coder does not necessarily employ a scoring func-
tion that approximates the BLEU score. Instead,
we only need to make sure that the top-ranked
block sequence obtained by the decoder scoring
function has a high BLEU score. To formulate
this idea, we attempt to find a decoding parame-
ter such that for each sentence a113 in the training
data, sequences in a114a80a7a6a113 a14 with the highest BLEU
scores should get a4 a5 a7a112a107 a14 scores higher than those
with low BLEU scores.
Denote by a114a13a138a115a7a6a113 a14 a set of a139 block sequences
in a114a116a7a6a113 a14 with the highest BLEU scores. Our de-
coded result should lie in this set. We call them
the “truth”. The set of the remaining sequences
is a114a115a7a6a113
a14a141a140
a114 a138 a7a6a113
a14 , which we shall refer to as the
“alternatives”. We look for a weight vector
a70
that
minimize the following training criterion:
a117
a70
a55a58a118a63a119a27a120a142a122a144a143 a145
a5
a98
a57
a134
a67a69a68
a11a147a146
a7
a70
a12
a114 a138 a7a6a113
a67
a14a78a12
a114a116a7a6a113
a67
a14a78a14
a148a150a149
a70
a59 (3)
a146
a7
a70
a12
a114a151a138
a12
a114
a14
a55
a98
a139
a126a89a127a25a129a30a152
a122a144a118a63a125
a126a154a153a39a127a25a129
a77
a129a30a152a156a155
a7
a70
a12
a107
a12
a107
a2 a14
a155
a7
a70
a12
a107
a12
a107
a2 a14 a55a58a157
a7a78a4 a5 a7a112a107
a14a78a12
a108a110a109a39a7a112a107
a14a78a158
a4 a5 a7a112a107
a2 a14a78a12
a108a110a109a111a7a112a107
a2 a14a78a14a78a12
where a157 is a non-negative real-valued loss func-
tion (whose specific choice is not critical for the
purposes of this paper),and a149a160a159 a94 is a regular-
ization parameter. In our experiments, results are
obtained using the following convex loss
a157
a7a78a4
a12a76a0a154a158
a4
a2 a12a76a0 a2 a14 a55
a7
a0a161a140a162a0 a2 a14
a7a78a98
a140
a7a78a4
a140
a4
a2 a14a78a14
a59a163
a12 (4)
where a0a63a12a76a0a10a2 are BLEU scores, a4
a12
a4
a2 are transla-
tion scores, and a7a78a164 a14 a163 a55 a122a124a118a63a125 a7a78a94 a12 a164 a14 . We refer
to this formulation as ’costMargin’ (cost-sensitive
margin) method: for each training sentence a165
the ’costMargin’
a146
a7
a70
a12
a114 a138 a7a6a113
a14a78a12
a114a116a7a6a113
a14a78a14 between the
’true’ block sequence set a114 a138 a7a6a113 a14 and the ’alterna-
tive’ block sequence set a114a80a7a6a113 a14 is maximized. Note
that due to the truth and alternative set up, we al-
ways have a0a167a166a82a0a3a2 . This loss function gives an up-
per bound of the error we will suffer if the order of
a4 and a4
a2 is wrongly predicted (that is, if we predict
a4a80a168a169a4
a2 instead of
a4
a166
a4
a2 ). It also has the property
that if for the BLEU scores a0a72a136a82a0a10a2 holds, then the
loss value is small (proportional to a0a170a140a171a0a10a2 ).
A major contribution of this work is a proce-
dure to solve Eq. 3 approximately. The main dif-
ficulty is that the search space a114a115a7a6a113 a14 covered by
the decoder can be extremely large. It cannot be
enumerated for practical purposes. Our idea is
to replace this large space by a small subspace
a114
a130a43a172a76a133
a7a6a113
a14a174a173
a114a115a7a6a113
a14 which we call relevant set. The
possibility of this reduction is based on the follow-
ing theoretical result.
Lemma 1 Let
a155
a7
a70
a12
a107
a12
a107
a2 a14 be a non-negative con-
tinuous piece-wise differentiable function of
a70
,
and let a117
a70
be a local solution of Eq. 3. Let
a175
a67
a7
a70
a12
a107
a14
a55a176a122a144a118a63a125
a126a154a153a177a127a18a129a131a130a111a132a89a178a39a133
a77
a129a13a152a147a130a111a132a89a178a39a133
a155
a7
a70
a12
a107
a12
a107
a2a3a14 , and
define
a114
a130a43a172a76a133
a7a6a113
a67
a14
a55
a81
a107
a2
a79
a114a116a7a6a113
a67
a14a8a179a181a180
a107
a79
a114 a138 a7a6a113
a67
a14 s.t.
a175
a67
a7
a117
a70
a12
a107
a14a167a182
a55
a94a184a183
a155
a7
a117
a70
a12
a107
a12
a107
a2 a14 a55 a175
a67
a7
a117
a70
a12
a107
a14a78a83
a92
Then a117
a70
is a local solution of
a122a144a143 a145
a5
a98
a57
a134
a67a69a68
a11a185a146
a7
a70
a12
a114 a138 a7a6a113
a67
a14a78a12
a114
a130a43a172a76a133
a7a6a113
a67
a14a78a14
a148a171a149
a70
a59
a92
(5)
If a157 is a convex function of
a70
(as in our choice),
then we know that the global optimal solution re-
mains the same if the whole decoding space a114 is
replaced by the relevant set a114 a130a186a172a76a133 .
Each subspace a114 a130a186a172a76a133 a7a6a113 a67 a14 will be significantly
smaller than a114a115a7a6a113
a67
a14 . This is because it only in-
cludes those alternatives a107 a2 with score a4a185a187a5 a7a112a107 a2a69a14 close
to one of the selected truth. These are the most im-
portant alternatives that are easily confused with
the truth. Essentially the lemma says that if the
decoder works well on these difficult alternatives
(relevant points), then it works well on the whole
space. The idea is closely related to active learn-
ing in standard classification problems, where we
724
Table 1: Generic Approximate Relevant Set Method
for each data point a113
initialize truth a114 a138 a7a6a113 a14 and alternative a114 a130a43a172a76a133 a7a6a113 a14
for each decoding iteration a188 : a189 a55 a98 a12
a73a76a73a76a73
a12
a60
for each data point a113
select relevant points a81a15a190a107a89a191 a83 a79 a114a115a7a6a113 a14 (*)
update a114 a130a186a172a76a133 a7a6a113 a14a141a192 a114 a130a43a172a76a133 a7a6a113 a14a194a193 a81a15a190a107a89a191 a83
update
a70
by solving Eq. 5 approximately (**)
selectively pick the most important samples (often
based on estimation uncertainty) for labeling in or-
der to maximize classification performance (Lewis
and Catlett, 1994). In the active learning setting,
as long as we do well on the actively selected sam-
ples, we do well on the whole sample space. In our
case, as long as we do well on the relevant set, the
decoder will perform well.
Since the relevant set depends on the decoder
parameter
a70
, and the decoder parameter is opti-
mized on the relevant set, it is necessary to es-
timate them jointly using an iterative algorithm.
The basic idea is to start with a decoding parame-
ter
a70
, and estimate the corresponding relevant set;
we then update
a70
based on the relevant set, and it-
erate this process. The procedure is outlined in Ta-
ble 1. We intentionally leave the implementation
details of the (*) step and (**) step open. More-
over, in this general algorithm, we do not have to
assume that a4 a5 a7a112a107 a14 has the form of Eq. 1.
A natural question concerning the procedure is
its convergence behavior. It can be shown that un-
der mild assumptions, if we pick in (*) an alterna-
tive a190a107a89a191 a79 a114a116a7a6a113 a14a131a140 a114a151a138a80a7a6a113 a14 for each a107a15a191 a79 a114a13a138a115a7a6a113 a14
(a195 a55 a98 a12 a92a76a92a76a92 a12 a139 ) such that
a155
a7
a70
a12
a107a89a191
a12
a190
a107a15a191
a14
a55 a122a124a118a63a125
a126a154a153a177a127a18a129a131a130a93a132a30a133
a77
a129a30a152a161a130a93a132a30a133 a155
a7
a70
a12
a107a89a191
a12
a107
a2 a14a78a12 (6)
then the procedure converges to the solution of
Eq. 3. Moreover, the rate of convergence depends
only on the property of the loss function, and not
on the size of a114a80a7a6a113
a14 . This property is critical as
it shows that as long as Eq. 6 can be computed
efficiently, then the Approximate Relevant Set al-
gorithm is efficient. Moreover, it gives a bound
on the size of an approximate relevant set with a
certain accuracy.5
5Due to the space limitation, we will not include a for-
The approximate solution of Eq. 5 in (**) can
be implemented using stochastic gradient descent
(SGD), where we may simply update
a70
as:
a70a197a196a198a70
a140a200a199a194a201
a5
a155
a7
a70
a12
a107a89a191
a12
a190
a107a15a191
a14
a92
The parameter a199a202a166 a94 is a fixed constant often re-
ferred to as learning rate. Again, convergence re-
sults can be proved for this procedure. Due to the
space limitation, we skip the formal statement as
well as the corresponding analysis.
Up to this point, we have not assumed any spe-
cific form of the decoder scoring function in our
algorithm. Now consider Eq. 1 used in our model.
We may express it as:
a4a63a5a8a7a112a107
a14
a55
a70 a71 a73a89a203
a7a112a107
a14a78a12
where
a203
a7a112a107
a14
a55 a9
a67a69a68
a11
a75
a7
a0
a67
a12
a1
a67
a12a76a0
a67a84a77
a11
a14 . Using this
feature representation and the loss function in
Eq. 4, we obtain the following costMargin SGD
update rule for each training data point and a195 :
a70a197a196a204a70
a148
a199a206a205
a108a110a109a43a191a18a207a208a191a131a7a78a98
a140
a70a72a71a74a73
a207a208a191
a14 a163 a12 (7)
a205
a108a110a109a209a191
a55
a108a110a109a111a7a112a107a89a191
a14a210a140
a108a110a109a93a7
a190
a107a89a191
a14a78a12
a207a206a191
a55
a203
a7a112a107a15a191
a14a211a140
a203
a7
a190
a107a89a191
a14
a92
4 Experimental Results
We applied the novel discriminative training ap-
proach to a standard Arabic-to-English translation
task. The training data comes from UN news
sources. Some punctuation tokenization and some
number classing are carried out on the English
and the Arabic training data. We show transla-
tion results in terms of the automatic BLEU evalu-
ation metric (Papineni et al., 2002) on the MT03
Arabic-English DARPA evaluation test set con-
sisting of a212a89a212a89a87 sentences with a98a89a212a161a213a89a214a89a215 Arabic words
with a95 reference translations. In order to speed
up the parameter training the original training data
is filtered according to the test set: all the Ara-
bic substrings that occur in the test set are com-
puted and the parallel training data is filtered to
include only those training sentence pairs that con-
tain at least one out of these phrases: the resulting
pre-filtered training data contains about a213a89a87a89a94 thou-
sand sentence pairs (a88a89a92a93a88a89a213 million Arabic words
and a212a89a92a93a214a89a212 million English words). The block set is
generated using a phrase-pair selection algorithm
similar to (Koehn et al., 2003; Al-Onaizan et al.,
2004), which includes some heuristic filtering to
mal statement here. A detailed theoretical investigation of
the method will be given in a journal paper.
725
increasea216 phrase translation accuracy. Blocks that
occur only once in the training data are included
as well.
4.1 Practical Implementation Details
The training algorithm in Table 2 is adapted from
Table 1. The training is carried out by running a60a217a55
a87a89a94 times over the parallel training data, each time
decoding all the a57a218a55 a213a89a87a89a94a161a94a89a94a89a94 training sentences
and generating a single block translation sequence
for each training sentence. The top five block se-
quences a114a13a219a89a7a6a113 a67 a14 with the highest BLEU score are
computed up-front for all training sentence pairs
a165
a67 and are stored separately as described in Sec-
tion 2. The score-based decoding of the a213a89a87a89a94a161a94a89a94a89a94
training sentence pairs is carried out in parallel on
a213a89a88a184a212a63a95 -Bit Opteron machines. Here, the monotone
decoding is much faster than the decoding with
block swapping: the monotone decoding takes less
than a94a89a92a93a88 hours and the decoding with swapping
takes about an hour. Since the training starts with
only the parallel training data and a block set,
some initial block sequences have to be generated
in order to initialize the global model training: for
each input sentence a simple bag of blocks trans-
lation is generated. For each input interval that is
matched by some block a0 , a single block is added
to the bag-of-blocks translation a107
a99
a7a6a113
a14 . The order
in which the blocks are generated is ignored. For
this block set only block and word identity fea-
tures are generated, i.e. features of type
a75
a11a39a99a78a99a78a99 and
a75
a11a39a99a78a99a18a11 in Section 2. This step does not require the
use of a decoder. The initial block sequence train-
ing data contains only a single alternative. The
training procedure proceeds by iteratively decod-
ing the training data. After each decoding step, the
resulting translation block sequences are stored on
disc in binary format. A block sequence gener-
ated at decoding step a189 a11 is used in all subsequent
training steps a189a63a59 , where a189a154a59
a166
a189
a11 . The block se-
quence training data after the a188 -th decoding step
is given as a7a211a114a30a219a89a7a6a113 a67 a14a194a12 a114 a130a186a172a154a133 a7a6a113 a67 a14a161a14 a134a67a69a68 a11 , where the
size a220 a114
a130a186a172a154a133
a7a6a113
a67
a14
a220 of the relevant alternative set is
a188
a148
a98 . Although in order to achieve fast conver-
gence with a theoretical guarantee, we should use
Eq. 6 to update the relevant set, in reality, this
idea is difficult to implement because it requires
a more costly decoding step. Therefore in Table 2,
we adopt an approximation, where the relevant set
is updated by adding the decoder output at each
stage. In this way, we are able to treat the decoding
Table 2: Relevant set method: a221 = number of decoding
iterations, a222 = number of training sentences.
for each input sentence a113
a67
a12a206a223
a55
a98
a12
a73a76a73a76a73
a12
a57
initialize truth a114 a219 a7a6a113 a67 a14 and alter-
native a114 a130a43a172a76a133 a55 a81 a107 a99 a7a6a113 a67 a14a78a83
for each decoding iteration a188 : a189 a55 a98 a12
a73a76a73a76a73
a12
a60
train
a70
using SGD on training
data a7a194a114 a219 a7a6a113 a67 a14a211a12 a114 a130a186a172a76a133 a7a6a113 a67 a14a141a14 a134a67a69a68 a11
for each input sentence a113 a67 a12a208a223 a55 a98 a12
a73a76a73a76a73
a12
a57
select top-scoring sequence a190a107a224a7a6a113 a67 a14 and
update a114
a130a43a172a76a133
a7a6a113
a67
a14a161a192
a114
a130a43a172a76a133
a7a6a113
a67
a14a211a193
a81a15a190
a107a21a7a6a113
a67
a14a78a83
scheme as a black box. One way to approximate
Eq. 6 is to generate multiple decoding outputs
and pick the most relevant points based on Eq. 6.
Since the a85 -best list generation is computation-
ally costly, only a single block sequence is gener-
ated for each training sentence pair, reducing the
memory requirements for the training algorithm
as well. Although we are not able to rigorously
prove fast convergence rate for this approximation,
it works well in practice, as Figure 2 shows. Theo-
retically this is because points achieving large val-
ues in Eq. 6 tend to have higher chances to become
the top-ranked decoder output as well. The SGD-
based on-line training algorithm described in Sec-
tion 3, is carried out after each decoding step to
generate the weight vector
a70
for the subsequent
decoding step. Since this training step is carried
out on a single machine, it dominates the overall
computation time. Since each iteration adds a sin-
gle relevant alternative to the set a114 a130a186a172a154a133 a7a6a113 a67 a14 , com-
putation time increases with the number of train-
ing iterations: the initial model is trained in a few
minutes, while training the model after the a87a89a94 -th
iteration takes up to a88 hours for the most complex
models.
Table 3 presents experimental results in terms of
uncased BLEU 6. Two re-ordering restrictions are
tested, i.e. monotone decoding (’MON’), and lo-
cal block re-ordering where neighbor blocks can
be swapped (’SWAP’). The ’SWAP’ re-ordering
uses the same features as the monotone models
plus additional orientation-based and distortion-
6Translation performance in terms of cased BLEU is typ-
ically reduced by about a225 %.
726
Table 3: Translation results in terms of uncased
BLEU on the training data (a213a89a87a89a94a161a94a89a94a89a94 sentences)
and the MT03 test data (670 sentences).
Re-ordering Features train test
a98 ’MON’ bleu a94a89a92a93a88a63a95a224a213 -
a213 phrase a94a89a92a93a87a89a214a89a215 a94a89a92a93a213a89a88a89a212
a87 word a94a89a92 a95a224a213a89a214 a94a89a92a93a87a63a95a224a98
a95 both a94a89a92 a95a224a214a89a214 a94a89a92a93a87a89a88a89a91
a88 ’SWAP’ bleu a94a89a92a93a88a89a91a63a95 -
a212 phrase a94a89a92 a95a89a95a224a98 a94a89a92a93a213a89a91a89a88
a214 word a94a89a92 a95a224a88a89a88 a94a89a92a93a87a89a88a89a91
a215 both a94a89a92 a95a224a214a89a91 a94a89a92a93a87a89a212a89a87
based features. Different feature sets include
word-based features, phrase-based features, and
the combination of both. For the results with
word-based features, the decoder still generates
phrase-to-phrase translations, but all the scoring
is done on the word level. Line a215 shows a BLEU
score of a87a89a212a89a92a93a87 for the best performing system which
uses all word-based and phrase-based features 7.
Line a98 and line a88 of Table 3 show the training
data averaged BLEU score obtained by searching
for the highest BLEU scoring block sequence for
each training sentence pair as described in Sec-
tion 2. Allowing local block swapping in this
search procedure yields a much improved BLEU
score of a94a89a92a93a88a89a91 . The experimental results show
that word-based models significantly outperform
phrase-based models, the combination of word-
based and phrase-based features performs better
than those features types taken separately. Addi-
tionally, swap-based re-ordering slightly improves
performance over monotone decoding. For all
experiments, the training BLEU score remains
significantly lower than the maximum obtainable
BLEU score shown in line a98 and line a88 . In this re-
spect, there is significant room for improvements
in terms of feature functions and alternative set
generation. The word-based models perform sur-
prisingly well, i.e. the model in line a214 uses only
three feature types: model a98 features like
a75
a11a39a99a78a99a18a11 in
Section 2, distortion features, and target language
m-gram features up to a106 a55 a87 . Training speed
varies depending on the feature types used: for
the simplest model shown in line a213 of Table 3, the
training takes about a98a89a213 hours, for the models us-
7With a margin of
a226a227a105a89a102 a105a128a228a229a104 , the differences between the
results in line a104 , line a101 , and line a103 are not statistically signifi-
cant, but the other result differences are.
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0  5  10  15  20  25  30
’SWAP.TRAINING’
’SWAP.TEST’
Figure 2: BLEU performance on the training set
(upper graph; averaged BLEU with single refer-
ence) and the test set (lower graph; BLEU with
four references) as a function of the training iter-
ation a230 for the model corresponding to line a215 in
Table 3.
ing word-based features shown in line a87 and line a214
training takes less than a213 days. Finally, the training
for the most complex model in line a215 takes about
a95 days.
Figure 2 shows the BLEU performance for the
model corresponding to line a215 in Table 3 as a
function of the number of training iterations. By
adding top scoring alternatives in the training al-
gorithm in Table 2, the BLEU performance on the
training data improves from about a94a89a92a93a213a89a213 for the ini-
tial model to about a94a89a92 a95a224a215 for the best model after
a87a89a94 iterations. After each training iteration the test
data is decoded as well. Here, the BLEU perfor-
mance improves from a94a89a92a93a94a89a215 for the initial model to
about a94a89a92a93a87a89a212 for the final model (we do not include
the test data block sequences in the training). Ta-
ble 3 shows a typical learning curve for the experi-
ments in Table 3: the training BLEU score is much
higher than the test set BLEU score despite the fact
that the test set uses a95 reference translations.
5 Discussion and Future Work
The work in this paper substantially differs from
previous work in SMT based on the noisy chan-
nel approach presented in (Brown et al., 1993).
While error-driven training techniques are com-
monly used to improve the performance of phrase-
based translation systems (Chiang, 2005; Och,
2003), this paper presents a novel block sequence
translation approach to SMT that is similar to
sequential natural language annotation problems
727
such as part-of-speech tagging or shallow parsing,
both in modeling and parameter training. Unlike
earlier approaches to SMT training, which either
rely heavily on domain knowledge, or can only
handle a small number of features, this approach
treats the decoding process as a black box, and
can optimize tens millions of parameters automat-
ically, which makes it applicable to other problems
as well. The choice of our formulation is convex,
which ensures that we are able to find the global
optimum even for large scale problems. The loss
function in Eq. 4 may not be optimal, and us-
ing different choices may lead to future improve-
ments. Another important direction for perfor-
mance improvement is to design methods that bet-
ter approximate Eq. 6. Although at this stage the
system performance is not yet better than previous
approaches, good translation results are achieved
on a standard translation task. While being similar
to (Tillmann and Zhang, 2005), the current proce-
dure is more automated with comparable perfor-
mance. The latter approach requires a decompo-
sition of the decoding scheme into local decision
steps with the inherent difficulty acknowledged in
(Tillmann and Zhang, 2005). Since such limitation
is not present in the current model, improved re-
sults may be obtained in the future. A perceptron-
like algorithm that handles global features in the
context of re-ranking is also presented in (Shen et
al., 2004).
The computational requirements for the training
algorithm in Table 2 can be significantly reduced.
While the global training approach presented in
this paper is simple, after a98a89a88 iterations or so, the
alternatives that are being added to the relevant set
differ very little from each other, slowing down
the training considerably such that the set of possi-
ble block translations a114a115a7a6a113 a14 might not be fully ex-
plored. As mentioned in Section 2, the current ap-
proach is still able to handle real-valued features,
e.g. the language model probability. This is im-
portant since the language model can be trained
on a much larger monolingual corpus.
6 Acknowledgment
This work was partially supported by the GALE
project under the DARPA contract No. HR0011-
06-2-0001. The authors would like to thank the
anonymous reviewers for their detailed criticism
on this paper.
References
Yaser Al-Onaizan, Niyu Ge, Young-Suk Lee, Kishore
Papineni, Fei Xia, and Christoph Tillmann. 2004.
IBM Site Report. In NIST 2004 MT Workshop,
Alexandria, VA, June. IBM.
Peter F. Brown, Vincent J. Della Pietra, Stephen
A. Della Pietra, and Robert L. Mercer. 1993. The
Mathematics of Statistical Machine Translation: Pa-
rameter Estimation. CL, 19(2):263–311.
David Chiang. 2005. A hierarchical phrase-based
model for statistical machine translation. In Proc. of
ACL 2005), pages 263–270, Ann Arbor, Michigan,
June.
Michael Collins. 2002. Discriminative training meth-
ods for hidden markov models: Theory and ex-
periments with perceptron algorithms. In Proc.
EMNLP’02, Philadelphia,PA.
A. Ittycheriah and S. Roukos. 2005. A Maximum
Entropy Word Aligner for Arabic-English MT. In
Proc. of HLT-EMNLP 06, pages 89–96, Vancouver,
British Columbia, Canada, October.
Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003.
Statistical phrase-based translation. In HLT-NAACL
2003: Main Proceedings, pages 127–133, Edmon-
ton, Alberta, Canada, May 27 - June 1.
Shankar Kumar and William Byrne. 2005. Lo-
cal phrase reordering models for statistical machine
translation. In Proc. of HLT-EMNLP 05, pages 161–
168, Vancouver, British Columbia, Canada, October.
D. Lewis and J. Catlett. 1994. Heterogeneous un-
certainty sampling for supervised learning. In Pro-
ceedings of the Eleventh International Conference
on Machine Learning, pages 148–156.
Ryan McDonald, Koby Crammer, and Fernando
Pereira. 2005. Online large-margin training of de-
pendency parsers. In Proceedings of ACL’05, pages
91–98, Ann Arbor, Michigan, June.
Franz Josef Och. 2003. Minimum error rate training
in statistical machine translation. In Proceedings of
ACL’03, pages 160–167, Sapporo, Japan.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of machine translation. In In Proc. of
ACL’02, pages 311–318, Philadelphia, PA, July.
Libin Shen, Anoop Sarkar, and Franz-Josef Och. 2004.
Discriminative Reranking of Machine Translation.
In Proceedings of the Joint HLT and NAACL Confer-
ence (HLT 04), pages 177–184, Boston, MA, May.
Christoph Tillmann and Tong Zhang. 2005. A local-
ized prediction model for statistical machine trans-
lation. In Proceedings of ACL’05, pages 557–564,
Ann Arbor, Michigan, June.
728
