Proceedings of the 43rd Annual Meeting of the ACL, pages 557–564,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
A Localized Prediction Model for Statistical Machine Translation
Christoph Tillmann and Tong Zhang
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598 USA
a0 ctill,tzhang
a1 @us.ibm.com
Abstract
In this paper, we present a novel training
method for a localized phrase-based predic-
tion model for statistical machine translation
(SMT). The model predicts blocks with orien-
tation to handle local phrase re-ordering. We
use a maximum likelihood criterion to train a
log-linear block bigram model which uses real-
valued features (e.g. a language model score)
as well as binary features based on the block
identities themselves, e.g. block bigram fea-
tures. Our training algorithm can easily handle
millions of features. The best system obtains
a a2a4a3a6a5a7 % improvement over the baseline on a
standard Arabic-English translation task.
1 Introduction
In this paper, we present a block-based model for statis-
tical machine translation. A block is a pair of phrases
which are translations of each other. For example, Fig. 1
shows an Arabic-English translation example that uses a8
blocks. During decoding, we view translation as a block
segmentation process, where the input sentence is seg-
mented from left to right and the target sentence is gener-
ated from bottom to top, one block at a time. A monotone
block sequence is generated except for the possibility to
swap a pair of neighbor blocks. We use an orientation
model similar to the lexicalized block re-ordering model
in (Tillmann, 2004; Och et al., 2004): to generate a block
a9 with orientation
a10 relative to its predecessor block a9a12a11 .
During decoding, we compute the probability a13a15a14
a9a17a16a18a20a19
a10
a16a18a6a21
of a block sequence a9a17a16a18 with orientation a10 a16a18 as a product
of block bigram probabilities:
a13a15a14
a9 a16a18 a19
a10
a16a18 a21a23a22
a16
a24a26a25
a18a28a27
a14
a9
a24
a19
a10
a24a17a29
a9
a24a31a30
a18
a19
a10
a24a31a30
a18
a21a12a19 (1)
a32a34a33
a35a37a36a34a38a40a39a34a41a37a36a34a42a43a36
a44a46a45a31a47a46a48
a39a34a49a31a36
a50
a39a34a51a17a52
a48
a39a34a41a37a36a34a42
a53a54a42a43a51a17a39a34a36
a48a31a45
a55
a48
a56
a55a58a57
a51
a55
a49
a57
a55
a48
a59
a51
a38a60
a61
a55
a48
a55
a42
a51
a55
a60
a48
a60
a61
a49
a41
a49a62
a63
a39
a45
a51a17a42a43a52a40a39a34a64a43a36
a48
a48
a38
a41
a55
a41
a60
a55
a48
a65a67a66
a55
a48
a55
a48a66
a50a60
a55
a32a34a68
a32a34a69
a32a71a70
Figure 1: An Arabic-English block translation example,
where the Arabic words are romanized. The following
orientation sequence is generated: a10 a18a73a72a75a74 a19 a10a77a76 a72a75a78 a19 a10a77a79 a72
a74 a19
a10a77a80
a72a75a81 .
where a9 a24 is a block and a10 a24a83a82a85a84 a78 a14 efta21a12a19 a81 a14 ighta21a86a19 a74 a14 eutrala21a54a87
is a three-valued orientation component linked to the
block a9 a24 (the orientation a10 a24a31a30 a18 of the predecessor block
is currently ignored.). Here, the block sequence with ori-
entation a14 a9a88a16a18 a19 a10 a16a18 a21 is generated under the restriction that
the concatenated source phrases of the blocks a9 a24 yield the
input sentence. In modeling a block sequence, we em-
phasize adjacent block neighbors that have Right or Left
orientation. Blocks with neutral orientation are supposed
to be less strongly ’linked’ to their predecessor block and
are handled separately. During decoding, most blocks
have right orientation a14a12a10
a72a89a81 a21 , since the block transla-
tions are mostly monotone.
557
The focus of this paper is to investigate issues in dis-
criminative training of decoder parameters. Instead of di-
rectly minimizing error as in earlier work (Och, 2003),
we decompose the decoding process into a sequence of
local decision steps based on Eq. 1, and then train each
local decision rule using convex optimization techniques.
The advantage of this approach is that it can easily han-
dle a large amount of features. Moreover, under this
view, SMT becomes quite similar to sequential natural
language annotation problems such as part-of-speech tag-
ging, phrase chunking, and shallow parsing.
The paper is structured as follows: Section 2 introduces
the concept of block orientation bigrams. Section 3
describes details of the localized log-linear prediction
model used in this paper. Section 4 describes the on-
line training procedure and compares it to the well known
perceptron training algorithm (Collins, 2002). Section 5
shows experimental results on an Arabic-English transla-
tion task. Section 6 presents a final discussion.
2 Block Orientation Bigrams
This section describes a phrase-based model for SMT
similar to the models presented in (Koehn et al., 2003;
Och et al., 1999; Tillmann and Xia, 2003). In our pa-
per, phrase pairs are named blocks and our model is de-
signed to generate block sequences. We also model the
position of blocks relative to each other: this is called
orientation. To define block sequences with orienta-
tion, we define the notion of block orientation bigrams.
Starting point for collecting these bigrams is a block set
a90
a72
a9 a72
a14a86a91
a19a86a92a93a21 a72
a14a86a94a4a95
a18a96a19a98a97a88a99a18a100a21 . Here, a9 is a block con-
sisting of a source phrase a91 and a target phrase a92 . a101 is
the source phrase length anda102 is the target phrase length.
Single source and target words are denoted by a94a4a103 and
a97
a24 respectively, where
a104
a72
a2
a19a100a105a43a105a43a105a4a19
a101 and a106
a72
a2
a19a100a105a43a105a107a105a77a19
a102 .
We will also use a special single-word block set a90 a18a96a108 a90
which contains only blocks for which a101 a72 a102 a72 a2 . For
the experiments in this paper, the block set is the one used
in (Al-Onaizan et al., 2004). Although this is not inves-
tigated in the present paper, different blocksets may be
used for computing the block statistics introduced in this
paper, which may effect translation results.
For the block set a90 and a training sentence pair, we
carry out a two-dimensional pattern matching algorithm
to find adjacent matching blocks along with their position
in the coordinate system defined by source and target po-
sitions (see Fig. 2). Here, we do not insist on a consistent
block coverage as one would do during decoding. Among
the matching blocks, two blocks a9a12a11 and a9 are adjacent if
the target phrases a92 and a92a96a11 as well as the source phrases
a91 and a91
a11 are adjacent. a9a12a11 is predecessor of block a9 if a9a12a11
and a9 are adjacent and a9a12a11 occurs below a9 . A right adjacent
successor block a9 is said to have right orientation a10
a72a109a81 .
A left adjacent successor block is said to have left orienta-
b
 b'
o=L
b
 b'
o=R
x axis:  source positions
a110a112a111a114a113a34a115a117a116a119a118
a120
a111a71a121a54a122a124a123
a120
a125a127a126 a116a119a115
a120
a115a126a129a128 a116
Local Block Orientation
Figure 2: Block a9a54a11 is the predecessor of block a9 . The
successor block a9 occurs with either left a10 a72a130a78 or right
a10
a72a130a81 orientation. ’left’ and ’right’ are defined relative
to thea131 axis ; ’below’ is defined relative to thea132 axis. For
some discussion on global re-ordering see Section 6.
tion a10 a72a133a78 . There are matching blocks a9 that have no pre-
decessor, such a block has neutral orientation (a10 a72a89a74 ).
After matching blocks for a training sentence pair, we
look for adjacent block pairs to collect block bigram ori-
entation events a134 of the type a134
a72
a14
a9 a11 a19
a10
a19a107a9a107a21 . Our model to
be presented in Section 3 is used to predict a future block
orientation pair a14 a9a4a19 a10 a21 given its predecessor block history
a9a12a11 . In Fig. 1, the following block orientation bigrams oc-
cur: a14 a105a135a19 a74 a19a107a9 a18 a21 ,a14 a9 a18 a19 a78 a19a43a9 a76 a21 ,a14 a105a26a19 a74 a19a43a9 a79 a21 ,a14 a9 a79 a19 a81 a19a43a9 a80 a21 . Collect-
ing orientation bigrams on all parallel sentence pairs, we
obtain an orientation bigram list a134a4a136a18 :
a134 a136
a18
a72 a137
a134
a16a124a138
a18a140a139a142a141a143
a25
a18
a72a144a137
a14
a9a88a145
a24
a19
a10
a24
a19a43a9
a24
a21 a16a46a138
a24a26a25
a18 a139a117a141a143
a25
a18 (2)
Here, a146 a143 is the number of orientation bigrams in the a94 -th
sentence pair. The total number a74 of orientation bigrams
a74a147a72 a141
a143
a25
a18
a146
a143 is about a74a147a72a149a148
a5a3 million for our train-
ing data consisting of a91 a72a151a150a127a148a77a152a154a153a34a153a34a153 sentence pairs. The
orientation bigram list is used for the parameter training
presented in Section 3. Ignoring the bigrams with neutral
orientation a74 reduces the list defined in Eq. 2 to about
a155
a5
a153 million orientation bigrams. The Neutral orientation
is handled separately as described in Section 5. Using the
reduced orientation bigram list, we collect unigram ori-
entation counts a74a157a156 a14 a9a100a21 : how often a block occurs with a
given orientation a10 a82a130a84 a78 a19 a81 a87 . a74a96a158 a14 a9a107a21a160a159 a153 a5a150 a155 a105 a74a96a161 a14 a9a100a21
typically holds for blocks a9 involved in block swapping
and the orientation model
a27
a156
a14
a9a100a21 is defined as:
a27
a156
a14
a9a107a21 a72
a74a157a156
a14
a9a100a21
a74a96a158
a14
a9a100a21a40a162 a74a157a161
a14
a9a107a21a71a5
In order to train a block bigram orientation model as de-
scribed in Section 3.2, we define a successor set a163
a143
a14
a9a12a11a86a21
for a block a9a12a11 in the a94 -th training sentence pair:
558
a163
a143
a14
a9a12a11a86a21
a72
a84 number of triples of type
a14
a9a12a11a164a19
a78
a19a43a9a107a21 or
type a14 a9a12a11a164a19 a81 a19a43a9a107a21 a82 a134 a16 a138a18 a87
The successor set a163a34a14 a9a12a11a86a21 is defined for each event in the
list a134a4a136a18 . The average size of a163a34a14 a9a54a11a86a21 is a2a114a5a155 successor blocks.
If we were to compute a Viterbi block alignment for a
training sentence pair, each block in this block alignment
would have at most a2 successor: Blocks may have sev-
eral successors, because we do not inforce any kind of
consistent coverage during training.
During decoding, we generate a list of block orien-
tation bigrams as described above. A DP-based beam
search procedure identical to the one used in (Tillmann,
2004) is used to maximize over all oriented block seg-
mentations a14 a9a88a16a18 a19 a10 a16a18 a21 . During decoding orientation bi-
grams a14 a9 a11 a19 a78 a19a43a9a107a21 with left orientation are only generated
if a74a157a158 a14
a9a107a21a166a165
a152 for the successor block
a9 .
3 Localized Block Model and
Discriminative Training
In this section, we describe the components used to com-
pute the block bigram probability
a27
a14
a9
a24
a19
a10
a24a17a29
a9
a24a31a30
a18
a19
a10
a24a31a30
a18
a21 in
Eq. 1. A block orientation pair a14a12a10 a11a167a19a43a9a12a11a86a168 a10 a19a107a9a107a21 is represented
as a feature-vector a169a170a14 a9a4a19 a10 a168a77a9 a11 a19 a10 a11 a21 a82a144a171 a172 . For a model that
uses all the components defined below, a173 isa7 . As feature-
vector components, we take the negative logarithm of
some block model probabilities. We use the term ’float’
feature for these feature-vector components (the model
score is stored as a float number). Additionally, we use
binary block features. The letters (a)-(f) refer to Table 1:
Unigram Models: we compute (a) the unigram proba-
bility
a27
a14
a9a107a21 and (b) the orientation probability
a27
a156
a14
a9a107a21 .
These probabilities are simple relative frequency es-
timates based on unigram and unigram orientation
counts derived from the data in Eq. 2. For details
see (Tillmann, 2004). During decoding, the uni-
gram probability is normalized by the source phrase
length.
Two types of Trigram language model: (c) probability
of predicting the first target word in the target clump
of a9 a24 given the final two words of the target clump
of a9 a24a164a30 a18 , (d) probability of predicting the rest of the
words in the target clump of a9 a24 . The language model
is trained on a separate corpus.
Lexical Weighting: (e) the lexical weight
a27
a14a12a91
a29
a92a93a21
of the block a9 a72 a14a12a91 a19a86a92a93a21 is computed similarly to
(Koehn et al., 2003), details are given in Section 3.4.
Binary features: (f) binary features are defined using an
indicator function a169a170a14 a9a43a19a43a9a12a11a86a21 which is a2 if the block
pair a14
a9a4a19a43a9a12a11a86a21 occurs more often than a given thresh-
old a74 , e.g a74a174a72a175a150 . Here, the orientation a10 between
the blocks is ignored.
a169a170a14
a9a4a19a43a9 a11 a21
a72
a2
a74
a14
a9a4a19a43a9a12a11a86a21a176a159
a74
a153 else (3)
3.1 Global Model
In our linear block model, for a given source sen-
tence a94 , each translation is represented as a sequence
of block/orientation pairs a84 a9a88a16a18 a19 a10 a16a18 a87 consistent with the
source. Using features such as those described above,
we can parameterize the probability of such a sequence
as a13a15a14 a9a17a16a18 a19 a10 a16a18 a29a177 a19 a94 a21 , wherea177 is a vector of unknown model
parameters to be estimated from the training data. We use
a log-linear probability model and maximum likelihood
training— the parameter a177 is estimated by maximizing
the joint likelihood over all sentences. Denote by a178a179a14a12a94 a21
the set of possible block/orientation sequences a84 a9a17a16a18 a19 a10 a16a18 a87
that are consistent with the source sentence a94 , then a log-
linear probability model can be represented as
a13a15a14
a9 a16a18 a19
a10
a16a18
a29a177
a19
a94
a21 a72a181a180a43a182a129a183 a14
a177a157a184
a169a170a14
a9a17a16a18 a19
a10
a16a18 a21a17a21
a185
a14a86a94
a21
a19 (4)
where a169a170a14 a9a17a16a18 a19 a10 a16a18 a21 denotes the feature vector of the corre-
sponding block translation, and the partition function is:
a185
a14a86a94
a21 a72 a186
a187
a145a142a188a189a154a190
a156
a145a191a188a189a83a192a119a193a100a194a73a195
a143a167a196
a180a43a182a129a183
a14
a177 a184
a169a170a14
a9 a11a142a197a18a198a19
a10
a11a117a197a18a96a21a54a21
a5
A disadvantage of this approach is that the summation
over a178a179a14a86a94 a21 can be rather difficult to compute. Conse-
quently some sophisticated approximate inference meth-
ods are needed to carry out the computation. A detailed
investigation of the global model will be left to another
study.
3.2 Local Model Restrictions
In the following, we consider a simplification of the di-
rect global model in Eq. 4. As in (Tillmann, 2004),
we model the block bigram probability as
a27
a14
a9
a24
a19
a10
a24 a82
a84
a78 a19 a81 a87
a29
a9
a24a164a30
a18 a19
a10
a24a31a30
a18 a21 in Eq. 1. We distinguish the two cases
(1) a10
a24a83a82a199a84
a78 a19 a81 a87 , and (2)
a10
a24
a72a75a74 . Orientation is modeled
only in the context of immediate neighbors for blocks that
have left or right orientation. The log-linear model is de-
fined as:
a27
a14
a9a4a19
a10
a82a199a84
a78 a19 a81 a87
a29
a9 a11 a19
a10
a11 a168
a177
a19
a94
a21 (5)
a72 a180a43a182a129a183
a14
a177a157a184
a169a170a14
a9a4a19
a10
a168a77a9a12a11a31a19
a10
a11a86a21a54a21
a185
a14
a9 a11 a19
a10
a11 a168
a94
a21
a19
where a94 is the source sentence, a169a170a14 a9a4a19 a10 a168a77a9a12a11a164a19 a10 a11a86a21 is a locally
defined feature vector that depends only on the current
and the previous oriented blocks a14 a9a43a19 a10 a21 and a14 a9a12a11a167a19 a10 a11a167a21 . The
features were described at the beginning of the section.
The partition function is given by
a185
a14
a9 a11 a19
a10
a11 a168
a94
a21 a72
a195
a187
a190
a156 a196
a193a100a194a73a195
a187
a145 a190
a156
a145a201a200
a143a167a196
a180a43a182a129a183
a14
a177 a184
a169a170a14
a9a4a19
a10
a168a77a9 a11 a19
a10
a11 a21a54a21
a5 (6)
559
The set a178a179a14 a9a12a11a167a19 a10 a11a167a168 a94 a21 is a restricted set of possible succes-
sor oriented blocks that are consistent with the current
block position and the source sentence a94 , to be described
in the following paragraph. Note that a straightforward
normalization over all block orientation pairs in Eq. 5
is not feasible: there are tens of millions of possible
successor blocks a9 (if we do not impose any restriction).
For each block a9 a72 a14a86a91 a19a12a92a93a21 , aligned with a source
sentence a94 , we define a source-induced alternative set:
a90
a14
a9a107a21
a72
a84 all blocks
a9a12a11a11
a82
a90 that share an identical
source phrase with a9a198a87
The set a90 a14 a9a107a21 contains the block a9 itself and the block
target phrases of blocks in that set might differ. To
restrict the number of alternatives further, the elements
of a90 a14 a9a107a21 are sorted according to the unigram count a74 a14 a9a12a11a11a86a21
and we keep at most the top a202 blocks for each source
interval a94 . We also use a modified alternative set a90 a18 a14 a9a107a21 ,
where the block a9 as well as the elements in the set
a90
a18
a14
a9a107a21 are single word blocks. The partition function
is computed slightly differently during training and
decoding:
Training: for each event a14 a9a12a11a164a19 a10 a19a43a9a100a21 in a sentence pair a94 in
Eq. 2 we compute the successor set a163 a143 a14 a9a12a11a167a21 . This de-
fines a set of ’true’ block successors. For each true
successor a9 , we compute the alternative set a90 a14 a9a107a21 .
a178a179a14
a9a54a11a31a19
a10
a11a86a168
a94
a21 is the union of the alternative set for each
successor a9 . Here, the orientation a10 from the true
successor a9 is assigned to each alternative in a90 a14 a9a107a21 .
We obtain on the average a2 a150 a5a3 alternatives per train-
ing event a14
a9a54a11a31a19
a10
a19a43a9a107a21 in the list
a134a4a136
a18 .
Decoding: Here, each block a9 that matches a source in-
terval following a9a54a11 in the sentence a94 is a potential
successor. We simply seta178a179a14 a9 a11 a19 a10 a11 a168 a94 a21 a72 a90 a14 a9a107a21 . More-
over, setting a185 a14 a9a12a11a164a19 a10 a11a12a168 a94 a21 a72a203a153 a5a155 during decoding does
not change performance: the list a90 a14 a9a107a21 just restricts
the possible target translations for a source phrase.
Under this model, the log-probability of a possible
translation of a source sentence a94 , as in Eq. 1, can be
written as
a204a191a205
a13a15a14
a9 a16a18 a19
a10
a16a18
a29a177
a19
a94
a21 a72 (7)
a72
a16
a24a26a25
a18
a204a135a205 a180a107a182a129a183 a14
a177a157a184
a169a170a14
a9
a24
a19
a10
a24
a168a77a9
a24a31a30
a18
a19
a10
a24a31a30
a18
a21a54a21
a185
a14
a9
a24a31a30
a18
a19
a10
a24a31a30
a18
a168
a94
a21 a5
In the maximum-likelihood training, we find a177 by maxi-
mizing the sum of the log-likelihood over observed sen-
tences, each of them has the form in Eq. 7. Although the
training methodology is similar to the global formulation
given in Eq. 4, this localized version is computationally
much easier to manage since the summation in the par-
tition function a185 a14
a9
a24a31a30
a18
a19
a10
a24a164a30
a18
a168
a94
a21 is now over a relatively
small set of candidates. This computational advantage
is the main reason that we adopt the local model in this
paper.
3.3 Global versus Local Models
Both the global and the localized log-linear models de-
scribed in this section can be considered as maximum-
entropy models, similar to those used in natural language
processing, e.g. maximum-entropy models for POS tag-
ging and shallow parsing. In the parsing context, global
models such as in Eq. 4 are sometimes referred to as con-
ditional random field or CRF (Lafferty et al., 2001).
Although there are some arguments that indicate that
this approach has some advantages over localized models
such as Eq. 5, the potential improvements are relatively
small, at least in NLP applications. For SMT, the differ-
ence can be potentially more significant. This is because
in our current localized model, successor blocks of dif-
ferent sizes are directly compared to each other, which
is intuitively not the best approach (i.e., probabilities
of blocks with identical lengths are more comparable).
This issue is closely related to the phenomenon of multi-
ple counting of events, which means that a source/target
sentence pair can be decomposed into different oriented
blocks in our model. In our current training procedure,
we select one as the truth, while consider the other (pos-
sibly also correct) decisions as non-truth alternatives. In
the global modeling, with appropriate normalization, this
issue becomes less severe. With this limitation in mind,
the localized model proposed here is still an effective
approach, as demonstrated by our experiments. More-
over, it is simple both computationally and conceptually.
Various issues such as the ones described above can be
addressed with more sophisticated modeling techniques,
which we shall be left to future studies.
3.4 Lexical Weighting
The lexical weight
a27
a14a12a91
a29
a92a93a21 of the block a9 a72
a14a12a91
a19a86a92a93a21 is
computed similarly to (Koehn et al., 2003), but the lexical
translation probability
a27
a14a12a94
a29
a97a100a21 is derived from the block
set itself rather than from a word alignment, resulting in
a simplified training. The lexical weight is computed as
follows:
a27
a14a12a91
a29
a92a93a21 a72
a95
a103
a25
a18
a2
a74a73a206
a14a12a94 a103
a19a12a92a93a21
a99
a207a25
a18 a27
a14a12a94a77a103
a29
a97
a24
a21
a27
a14a12a94 a103
a29
a97
a24
a21 a72
a74
a14
a9a107a21
a187
a145 a193
a206
a189
a195
a187
a196
a74
a14
a9 a11 a21
Here, the single-word-based translation probability
a27
a14a12a94a4a103
a29
a97
a24
a21 is derived from the block set itself. a9 a72
a14a12a94a4a103
a19a86a97
a24
a21
and a9a12a11 a72 a14a12a94a77a103 a19a12a97a88a208a46a21 are single-word blocks, where source
and target phrases are of length a2 . a74 a206 a14a86a94a4a103 a19a12a97 a99a18a107a21 is the num-
ber of blocks a9 a208 a72 a14a12a94 a103
a19a12a97 a208 a21 for
a209
a82
a2
a19a100a105a43a105a107a105a77a19
a102 for which
a27
a14a12a94 a103
a29
a97 a208 a21a176a159 a153
a5
a153 .
560
4 Online Training of Maximum-entropy
Model
The local model described in Section 3 leads to the fol-
lowing abstract maximum entropy training formulation:
a210
a177
a72a203a211a213a212a17a214a83a215a93a216
a205
a217
a197
a24a67a25
a18
a204a191a205 a103 a193a100a194a219a218
a180a107a182a20a183
a14
a177 a184
a131
a24
a190a103
a21
a180a107a182a20a183
a14
a177 a184
a131
a24
a190a220 a218
a21 a5 (8)
In this formulation,a177 is the weight vector which we want
to compute. The set a178 a24 consists of candidate labels for
the a106 -th training instance, with the true label a132 a24 a82 a178 a24 .
The labels here are block identities , a178 a24 corresponds to
the alternative set a178a221a14 a9a12a11a164a19 a10 a11a86a168 a94 a21 and the ’true’ blocks are
defined by the successor set a163a34a14
a9a12a11a86a21 . The vector
a131
a24
a190a103 is the
feature vector of the a106 -th instance, corresponding to la-
bel a104
a82
a178
a24 . The symbol
a131 is short-hand for the feature-
vector a169a170a14 a9a4a19 a10 a168a77a9a12a11a164a19 a10 a11a86a21 . This formulation is slightly differ-
ent from the standard maximum entropy formulation typ-
ically encountered in NLP applications, in that we restrict
the summation over a subset a178 a24 of all labels.
Intuitively, this method favors a weight vector such that
for each a106 , a177a157a184 a131 a24 a190a220a107a218a40a222 a177a157a184 a131 a24 a190a103 is large when a104a85a223a72 a132 a24 . This
effect is desirable since it tries to separate the correct clas-
sification from the incorrect alternatives. If the problem
is completely separable, then it can be shown that the
computed linear separator, with appropriate regulariza-
tion, achieves the largest possible separating margin. The
effect is similar to some multi-category generalizations of
support vector machines (SVM). However, Eq. 8 is more
suitable for non-separable problems (which is often the
case for SMT) since it directly models the conditional
probability for the candidate labels.
A related method is multi-category perceptron, which
explicitly finds a weight vector that separates correct la-
bels from the incorrect ones in a mistake driven fashion
(Collins, 2002). The method works by examining one
sample at a time, and makes an updatea177a225a224a226a177 a162 a14a117a131 a24 a190a220a43a218a129a222
a131
a24
a190a103
a21 when
a177a157a184
a14a191a131
a24
a190a220a107a218a170a222 a131
a24
a190a103
a21 is not positive. To compute
the update for a training instancea106 , one usually pick thea104
such thata177a112a184 a14a117a131 a24 a190a220a43a218a28a222 a131 a24 a190a103 a21 is the smallest. It can be shown
that if there exist weight vectors that separate the correct
label a132 a24 from incorrect labels a104 a82 a178 a24 for all a104a85a223a72 a132 a24 , then
the perceptron method can find such a separator. How-
ever, it is not entirely clear what this method does when
the training data are not completely separable. Moreover,
the standard mistake bound justification does not apply
when we go through the training data more than once, as
typically done in practice. In spite of some issues in its
justification, the perceptron algorithm is still very attrac-
tive due to its simplicity and computational efficiency. It
also works quite well for a number of NLP applications.
In the following, we show that a simple and efficient
online training procedure can also be developed for the
maximum entropy formulation Eq. 8. The proposed up-
date rule is similar to the perceptron method but with a
soft mistake-driven update rule, where the influence of
each feature is weighted by the significance of its mis-
take. The method is essentially a version of the so-
called stochastic gradient descent method, which has
been widely used in complicated stochastic optimization
problems such as neural networks. It was argued re-
cently in (Zhang, 2004) that this method also works well
for standard convex formulations of binary-classification
problems including SVM and logistic regression. Con-
vergence bounds similar to perceptron mistake bounds
can be developed, although unlike perceptron, the theory
justifies the standard practice of going through the train-
ing data more than once. In the non-separable case, the
method solves a regularized version of Eq. 8, which has
the statistical interpretation of estimating the conditional
probability. Consequently, it does not have the potential
issues of the perceptron method which we pointed out
earlier. Due to the nature of online update, just like per-
ceptron, this method is also very simple to implement and
is scalable to large problem size. This is important in the
SMT application because we can have a huge number of
training instances which we are not able to keep in mem-
ory at the same time.
In stochastic gradient descent, we examine one train-
ing instance at a time. At the a106 -th instance, we derive
the update rule by maximizing with respect to the term
associated with the instance
a78
a24
a14
a177
a21 a72 a204a135a205 a103 a193a100a194 a218
a180a43a182a129a183
a14
a177a157a184
a131
a24
a190a103
a21
a180a107a182a129a183
a14
a177 a184
a131
a24
a190a220a107a218
a21
in Eq. 8. We do a gradient descent localized to this in-
stance as a177a227a224a228a177 a222a230a229 a24a96a231
a231
a217
a78
a24
a14
a177
a21 , where
a229
a24
a159 a153 is a pa-
rameter often referred to as the learning rate. For Eq. 8,
the update rule becomes:
a177a225a224a226a177
a162
a229
a24
a103 a193a100a194a219a218
a180a43a182a129a183
a14
a177a157a184
a131
a24
a190a103
a21
a14a191a131
a24
a190a220a107a218a40a222 a131
a24
a190a103
a21
a103 a193a100a194a219a218
a180a107a182a129a183
a14
a177 a184
a131
a24
a190a103
a21 a5 (9)
Similar to online algorithms such as the perceptron, we
apply this update rule one by one to each training instance
(randomly ordered), and may go-through data points re-
peatedly. Compare Eq. 9 to the perceptron update, there
are two main differences, which we discuss below.
The first difference is the weighting scheme. In-
stead of putting the update weight to a single
(most mistaken) feature component, as in the per-
ceptron algorithm, we use a soft-weighting scheme,
with each feature component a104 weighted by a fac-
tor a180a43a182a129a183 a14a177a112a184 a131 a24 a190a103 a21a12a232 a208
a193a100a194 a218
a180a107a182a129a183
a14
a177a157a184
a131
a24
a190
a208 a21 . A component
a104
with larger a177a112a184 a131 a24 a190a103 gets more weight. This effect is in
principle similar to the perceptron update. The smooth-
ing effect in Eq. 9 is useful for non-separable problems
561
since it does not force an update rule that attempts to sep-
arate the data. Each feature component gets a weight that
is proportional to its conditional probability.
The second difference is the introduction of a learn-
ing rate parameter a229 a24 . For the algorithm to converge, one
should pick a decreasing learning rate. In practice, how-
ever, it is often more convenient to select a fixed a229 a24 a72 a229
for all a106 . This leads to an algorithm that approximately
solve a regularized version of Eq. 8. If we go through the
data repeatedly, one may also decrease the fixed learning
rate by monitoring the progress made each time we go
through the data. For practical purposes, a fixed small a229
such as a229 a72 a2 a153 a30a46a233 is usually sufficient. We typically run
forty updates over the training data. Using techniques
similar to those of (Zhang, 2004), we can obtain a con-
vergence theorem for our algorithm. Due to the space
limitation, we will not present the analysis here.
An advantage of this method over standard maximum
entropy training such as GIS (generalized iterative scal-
ing) is that it does not require us to store all the data
in memory at once. Moreover, the convergence analy-
sis can be used to show that if a234 is large, we can get
a very good approximate solution by going through the
data only once. This desirable property implies that the
method is particularly suitable for large scale problems.
5 Experimental Results
The translation system is tested on an Arabic-to-English
translation task. The training data comes from the UN
news sources. Some punctuation tokenization and some
number classing are carried out on the English and the
Arabic training data. In this paper, we present results for
two test sets: (1) the devtest set uses data provided by
LDC, which consists of a2 a153 a8 a152 sentences with a150 a155 a3a213a3a34a202 Ara-
bic words with a8 reference translations. (2) the blind test
set is the MT03 Arabic-English DARPA evaluation test
set consisting of a7a34a7 a152 sentences with a2a77a7 a150a98a148 a3 Arabic words
with also a8 reference translations. Experimental results
are reported in Table 2: here cased BLEU results are re-
ported on MT03 Arabic-English test set (Papineni et al.,
2002). The word casing is added as post-processing step
using a statistical model (details are omitted here).
In order to speed up the parameter training we filter the
original training data according to the two test sets: for
each of the test sets we take all the Arabic substrings up
to length a2 a150 and filter the parallel training data to include
only those training sentence pairs that contain at least one
out of these phrases: the ’LDC’ training data contains
about a150a98a148a77a152 thousand sentence pairs and the ’MT03’ train-
ing data contains abouta150a213a152a34a153 thousand sentence pairs. Two
block sets are derived for each of the training sets using
a phrase-pair selection algorithm similar to (Koehn et al.,
2003; Tillmann and Xia, 2003). These block sets also
include blocks that occur only once in the training data.
Additionally, some heuristic filtering is used to increase
phrase translation accuracy (Al-Onaizan et al., 2004).
5.1 Likelihood Training Results
We compare model performance with respect to the num-
ber and type of features used as well as with respect
to different re-ordering models. Results for a202 experi-
ments are shown in Table 2, where the feature types are
described in Table 1. The first a155 experimental results
are obtained by carrying out the likelihood training de-
scribed in Section 3. Line a2 in Table 2 shows the per-
formance of the baseline block unigram ’MON’ model
which uses two ’float’ features: the unigram probabil-
ity and the boundary-word language model probability.
No block re-ordering is allowed for the baseline model
(a monotone block sequence is generated). The ’SWAP’
model in line a150 uses the same two features, but neigh-
bor blocks can be swapped. No performance increase is
obtained for this model. The ’SWAP & OR’ model uses
an orientation model as described in Section 3. Here, we
obtain a small but significant improvement over the base-
line model. Linea8 shows that by including two additional
’float’ features: the lexical weighting and the language
model probability of predicting the second and subse-
quent words of the target clump yields a further signif-
icant improvement. Line a155 shows that including binary
features and training their weights on the training data
actually decreases performance. This issue is addressed
in Section 5.2.
The training is carried out as follows: the results in line
a2 -a8 are obtained by training ’float’ weights only. Here,
the training is carried out by running only once over a2
a153
% of the training data. The model including the binary
features is trained on the entire training data. We obtain
about a152 a5a152a98a148 million features of the type defined in Eq. 3
by setting the threshold a74a235a72a236a152 . Forty iterations over the
training data take abouta150 hours on a single Intel machine.
Although the online algorithm does not require us to do
so, our training procedure keeps the entire training data
and the weight vectora177 in about a150 gigabytes of memory.
For blocks with neutral orientation a10
a72a227a74 , we train
a separate model that does not use the orientation model
feature or the binary features. E.g. for the results in line
a155 in Table 2, the neutral model would use the features
a14a86a237
a21a12a19
a14a86a238
a21a86a19
a14a12a173
a21a12a19
a14a12a134
a21 , but not
a14
a9a107a21 and
a14a86a169
a21 . Here, the neutral
model is trained on the neutral orientation bigram subse-
quence that is part of Eq. 2.
5.2 Modified Weight Training
We implemented the following variation of the likeli-
hood training procedure described in Section 3, where
we make use of the ’LDC’ devtest set. First, we train
a model on the ’LDC’ training data using a155 float features
and the binary features. We use this model to decode
562
Table 1: List of feature-vector components. For a de-
scription, see Section 3.
Description
(a) Unigram probability
(b) Orientation probability
(c) LM first word probability
(d) LM second and following words probability
(e) Lexical weighting
(f) Binary Block Bigram Features
Table 2: Cased BLEU translation results with confidence
intervals on the MT03 test data. The third column sum-
marizes the model variations. The results in lines a3 and
a202 are for a cheating experiment: the float weights are
trained on the test data itself.
Re-ordering Components BLEU
1 ’MON’ (a),(c) a152a34a150 a5a152a112a239 a2a71a5a155
2 ’SWAP’ (a),(c) a152a34a150 a5a152a112a239 a2a71a5a155
3 ’SWAP & OR’ (a),(b),(c) a152a34a152 a5a202 a239 a2a71a5a8
4 ’SWAP & OR’ (a)-(e) a152a127a148 a5
a148a240a239
a2a71a5
a155
5 ’SWAP & OR’ (a)-(f) a152a127a148 a5a150a112a239 a2a71a5a7
6 ’SWAP & OR’ (a)-(e) (ldc devtest) a152a127a148 a5a3 a239 a2a71a5a155
7 ’SWAP & OR’ (a)-(f) (ldc devtest) a152 a3a6a5
a150a112a239
a2a71a5
a155
8 ’SWAP & OR’ (a)-(e) (mt03 test) a152 a202a6a5a153a112a239 a2a71a5a155
9 ’SWAP & OR’ (a)-(f) (mt03 test) a152 a202a6a5a152a112a239 a2a71a5a7
the devtest ’LDC’ set. During decoding, we generate a
’translation graph’ for every input sentence using a proce-
dure similar to (Ueffing et al., 2002): a translation graph
is a compact way of representing candidate translations
which are close in terms of likelihood. From the transla-
tion graph, we obtain the a2 a153a213a153a34a153 best translations accord-
ing to the translation score. Out of this list, we find the
block sequence that generated the top BLEU-scoring tar-
get translation. Computing the top BLEU-scoring block
sequence for all the input sentences we obtain:
a134 a136
a145a18
a72 a137
a14
a9 a145
a24
a19
a10
a24
a19a107a9
a24
a21
a16 a138
a145
a24a67a25
a18a54a139 a141 a145a18 a19 (10)
where a74 a11a160a22 a202a213a8 a153a34a153 . Here, a74 a11 is the number of blocks
needed to decode the entire devtest set. Alternatives for
each of the events in a134a77a136 a145a18 are generated as described in
Section 3.2. The set of alternatives is further restricted
by using only those blocks that occur in some translation
in the a2 a153a34a153a34a153 -best list. The a155 float weights are trained on
the modified training data in Eq. 10, where the training
takes only a few seconds. We then decode the ’MT03’
test set using the modified ’float’ weights. As shown in
line a8 and line a7 there is almost no change in perfor-
mance between training on the original training data in
Eq. 2 or on the modified training data in Eq. 10. Line
a3 shows that even when training the float weights on an
event set obtained from the test data itself in a cheating
experiment, we obtain only a moderate performance im-
provement from a152a98a148 a5 a148 to a152 a202a6a5a153 . For the experimental re-
sults in line a148 and a202 , we use the same five float weights
as trained for the experiments in line a7 and a3 and keep
them fixed while training the binary feature weights only.
Using the binary features leads to only a minor improve-
ment in BLEU from a152a98a148 a5a3 to a152 a3a6a5a150 in line a148 . For this best
model, we obtain a a2a77a3a241a5a7 % BLEU improvement over the
baseline.
From our experimental results, we draw the following
conclusions: (1) the translation performance is largely
dominated by the ’float’ features, (2) using the same set
of ’float’ features, the performance doesn’t change much
when training on training, devtest, or even test data. Al-
though, we do not obtain a significant improvement from
the use of binary features, currently, we expect the use of
binary features to be a promising approach for the follow-
ing reasons:
a242 The current training does not take into account the
block interaction on the sentence level. A more ac-
curate approximation of the global model as dis-
cussed in Section 3.1 might improve performance.
a242 As described in Section 3.2 and Section 5.2, for
efficiency reasons alternatives are computed from
source phrase matches only. During training, more
accurate local approximations for the partition func-
tion in Eq. 6 can be obtained by looking at block
translations in the context of translation sequences.
This involves the computationally expensive genera-
tion of a translation graph for each training sentence
pair. This is future work.
a242 As mentioned in Section 1, viewing the translation
process as a sequence of local discussions makes it
similar to other NLP problems such as POS tagging,
phrase chunking, and also statistical parsing. This
similarity may facilitate the incorporation of these
approaches into our translation model.
6 Discussion and Future Work
In this paper we proposed a method for discriminatively
training the parameters of a block SMT decoder. We
discussed two possible approaches: global versus local.
This work focused on the latter, due to its computational
advantages. Some limitations of our approach have also
been pointed out, although our experiments showed that
this simple method can significantly improve the baseline
model.
As far as the log-linear combination of float features
is concerned, similar training procedures have been pro-
posed in (Och, 2003). This paper reports the use of a3
563
features whose parameter are trained to optimize per-
formance in terms of different evaluation criteria, e.g.
BLEU. On the contrary, our paper shows that a signifi-
cant improvement can also be obtained using a likelihood
training criterion.
Our modified training procedure is related to the dis-
criminative re-ranking procedure presented in (Shen et
al., 2004). In fact, one may view discriminative rerank-
ing as a simplification of the global model we discussed,
in that it restricts the number of candidate global transla-
tions to make the computation more manageable. How-
ever, the number of possible translations is often expo-
nential in the sentence length, while the number of can-
didates in a typically reranking approach is fixed. Un-
less one employs an elaborated procedure, the candi-
date translations may also be very similar to one another,
and thus do not give a good coverage of representative
translations. Therefore the reranking approach may have
some severe limitations which need to be addressed. For
this reason, we think that a more principled treatment of
global modeling can potentially lead to further perfor-
mance improvements.
For future work, our training technique may be used
to train models that handle global sentence-level reorder-
ings. This might be achieved by introducing orienta-
tion sequences over phrase types that have been used in
((Schafer and Yarowsky, 2003)). To incorporate syntac-
tic knowledge into the block-based model, we will exam-
ine the use of additional real-valued or binary features,
e.g. features that look at whether the block phrases cross
syntactic boundaries. This can be done with only minor
modifications to our training method.
Acknowledgment
This work was partially supported by DARPA and mon-
itored by SPAWAR under contract No. N66001-99-2-
8916. The paper has greatly profited from suggestions
by the anonymous reviewers.
References
Yaser Al-Onaizan, Niyu Ge, Young-Suk Lee, Kishore Pa-
pineni, Fei Xia, and Christoph Tillmann. 2004. IBM
Site Report. In NIST 2004 Machine Translation Work-
shop, Alexandria, VA, June.
Michael Collins. 2002. Discriminative training methods
for hidden markov models: Theory and experiments
with perceptron algorithms. In Proc. EMNLP’02.
Philipp Koehn, Franz-Josef Och, and Daniel Marcu.
2003. Statistical Phrase-Based Translation. In Proc.
of the HLT-NAACL 2003 conference, pages 127–133,
Edmonton, Canada, May.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-
ditional random fields: Probabilistic models for seg-
menting and labeling sequence data. In Proceedings
of ICML-01, pages 282–289.
Franz-Josef Och, Christoph Tillmann, and Hermann Ney.
1999. Improved Alignment Models for Statistical Ma-
chine Translation. In Proc. of the Joint Conf. on Em-
pirical Methods in Natural Language Processing and
Very Large Corpora (EMNLP/VLC 99), pages 20–28,
College Park, MD, June.
Och et al. 2004. A Smorgasbord of Features for Statis-
tical Machine Translation. In Proceedings of the Joint
HLT and NAACL Conference (HLT 04), pages 161–
168, Boston, MA, May.
Franz-Josef Och. 2003. Minimum Error Rate Train-
ing in Statistical Machine Translation. In Proc. of
the 41st Annual Conf. of the Association for Computa-
tional Linguistics (ACL 03), pages 160–167, Sapporo,
Japan, July.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of machine translation. In Proc. of the
40th Annual Conf. of the Association for Computa-
tional Linguistics (ACL 02), pages 311–318, Philadel-
phia, PA, July.
Charles Schafer and David Yarowsky. 2003. Statistical
Machine Translation Using Coercive Two-Level Syn-
tactic Translation. In Proc. of the Conf. on Empiri-
cal Methods in Natural Language Processing (EMNLP
03), pages 9–16, Sapporo, Japan, July.
Libin Shen, Anoop Sarkar, and Franz-Josef Och. 2004.
Discriminative Reranking of Machine Translation. In
Proceedings of the Joint HLT and NAACL Conference
(HLT 04), pages 177–184, Boston, MA, May.
Christoph Tillmann and Fei Xia. 2003. A Phrase-based
Unigram Model for Statistical Machine Translation. In
Companian Vol. of the Joint HLT and NAACL Confer-
ence (HLT 03), pages 106–108, Edmonton, Canada,
June.
Christoph Tillmann. 2004. A Unigram Orientation
Model for Statistical Machine Translation. In Com-
panian Vol. of the Joint HLT and NAACL Conference
(HLT 04), pages 101–104, Boston, MA, May.
Nicola Ueffing, Franz-Josef Och, and Hermann Ney.
2002. Generation of Word Graphs in Statistical Ma-
chine Translation. In Proc. of the Conf. on Empiri-
cal Methods in Natural Language Processing (EMNLP
02), pages 156–163, Philadelphia, PA, July.
Tong Zhang. 2004. Solving large scale linear prediction
problems using stochastic gradient descent algorithms.
In ICML 04, pages 919–926.
564
