Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 161–168, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Local Phrase Reordering Models for Statistical Machine Translation
Shankar Kumar, William Byrne∗
Center for Language and Speech Processing, Johns Hopkins University,
3400 North Charles Street, Baltimore, MD 21218, U.S.A.
Machine Intelligence Lab, Cambridge University Engineering Department,
Trumpington Street, Cambridge CB2 1PZ, U.K.
skumar@jhu.edu , wjb31@cam.ac.uk
Abstract
We describe stochastic models of local
phrase movement that can be incorpo-
rated into a Statistical Machine Transla-
tion (SMT) system. These models pro-
vide properly formulated, non-deficient,
probability distributions over reordered
phrase sequences. They are imple-
mented by Weighted Finite State Trans-
ducers. We describe EM-style parameter
re-estimation procedures based on phrase
alignment under the complete translation
model incorporating reordering. Our ex-
periments show that the reordering model
yields substantial improvements in trans-
lation performance on Arabic-to-English
and Chinese-to-English MT tasks. We
also show that the procedure scales as the
bitext size is increased.
1 Introduction
Word and Phrase Reordering is a crucial component
of Statistical Machine Translation (SMT) systems.
However allowing reordering in translation is com-
putationally expensive and in some cases even prov-
ably NP-complete (Knight, 1999). Therefore any
translation scheme that incorporates reordering must
necessarily balance model complexity against the
ability to realize the model without approximation.
In this paper our goal is to formulate models of lo-
cal phrase reordering in such a way that they can be
embedded inside a generative phrase-based model
∗ This work was supported by an ONR MURI Grant
N00014-01-1-0685.
of translation (Kumar et al., 2005). Although this
model of reordering is somewhat limited and can-
not capture all possible phrase movement, it forms
a proper parameterized probability distribution over
reorderings of phrase sequences. We show that with
this model it is possible to perform Maximum A
Posteriori (MAP) decoding (with pruning) and Ex-
pectation Maximization (EM) style re-estimation of
model parameters over large bitext collections.
We now discuss prior work on word and phrase
reordering in translation. We focus on SMT systems
that do not require phrases to form syntactic con-
stituents.
The IBM translation models (Brown et al., 1993)
describe word reordering via a distortion model de-
fined over word positions within sentence pairs. The
Alignment Template Model (Och et al., 1999) uses
phrases rather than words as the basis for transla-
tion, and defines movement at the level of phrases.
Phrase reordering is modeled as a first order Markov
process with a single parameter that controls the de-
gree of movement.
Our current work is inspired by the block
(phrase-pair) orientation model introduced by Till-
mann (2004) in which reordering allows neighbor-
ing blocks to swap. This is described as a sequence
of orientations (left, right, neutral) relative to the
monotone block order. Model parameters are block-
specific and estimated over word aligned trained bi-
text using simple heuristics.
Other researchers (Vogel, 2003; Zens and Ney,
2003; Zens et al., 2004) have reported performance
gains in translation by allowing deviations from
monotone word and phrase order. In these cases,
161
0
c
4
c
5
c
0
d
1
d
1
v
2
v
3
v
4
v
5
v
6
v
7
v
1
f
2
f
3
f
4
f
5
f
6
f
7
f
8
f
9
f
2
d
3
d
4
d
5
d
2
c
3
c
1
c
x
1 x
2
x
3
x
4
x
5
1
e
5
e
7
e
2
e
3
e
4
e
6
e
9
e
8
e
u
1
u
2
u
3
u
4
u
5
y
1
y
5
y
4
y
3
y
2
doivent de_25_%exportationsgrains fléchir
exportationsgrains de_25_%doiventfléchir
1 
les exportationsde 
les  exportations  de  grains  doivent  fléchir de  25  %
grainsdoiventfléchirde_25_%
1
.
exportations doiventgrains fléchirde_25_%
grainexportsare_projected_toby_25_%
grain  exports  are  projected  to  fall  by  25  %
Sentence
fall
Source Language
Target Language Sentence
Figure 1: TTM generative translation process; here,
I = 9,K = 5,R = 7,J = 9.
reordering is not governed by an explicit probabilis-
tic model over reordered phrases; a language model
is employed to select the translation hypothesis. We
also note the prior work of Wu (1996), closely re-
lated to Tillmann’s model.
2 The WFST Reordering Model
The Translation Template Model (TTM) is a genera-
tive model of phrase-based translation (Brown et al.,
1993). Bitext is described via a stochastic process
that generates source (English) sentences and trans-
forms them into target (French) sentences (Fig 1 and
Eqn 1).
P(fJ1 ,vR1 ,dK0 ,cK0 ,yK1 ,xK1 ,uK1 ,K,eI1) =
P(eI1)·
Source Language Model G
P(uK1 ,K|eI1)·
Source Phrase Segmentation W
P(xK1 |uK1 ,K,eI1)·
Phrase Translation and Reordering R
P(vR1 ,dK0 ,cK0 ,yK1 |xK1 ,uK1 ,K,eI1)·
Target Phrase Insertion Φ
P(fJ1 |vR1 ,dK0 ,cK0 ,yK1 ,xK1 ,uK1 ,K,eI1)
Target Phrase Segmentation Ω
(1)
The TTM relies on a Phrase-Pair Inventory (PPI)
consisting of target language phrases and their
source language translations. Translation is mod-
eled via component distributions realized as WFSTs
(Fig 1 and Eqn 1) : Source Language Model (G),
Source Phrase Segmentation (W), Phrase Transla-
tion and Reordering (R), Target Phrase Insertion
(Φ), and Target Phrase Segmentation (Ω) (Kumar et
al., 2005).
TTM Reordering Previously, the TTM was for-
mulated with reordering prior to translation; here,
we perform reordering of phrase sequences follow-
ing translation. Reordering prior to translation was
found to be memory intensive and unwieldy (Kumar
et al., 2005). In contrast, we will show that the cur-
rent model can be used for both phrase alignment
and translation.
2.1 The Phrase Reordering Model
We now describe two WFSTs that allow local re-
ordering within phrase sequences. The simplest al-
lows swapping of adjacent phrases. The second al-
lows phrase movement within a three phrase win-
dow. Our formulation ensures that the overall model
provides a proper parameterized probability distri-
bution over reordered phrase sequences; we empha-
size that the resulting distribution is not degenerate.
Phrase reordering (Fig 2) takes as its input a
French phrase sequence in English phrase order
x1,x2,...,xK. This is then reordered into French
phrase order y1,y2,...,yK . Note that words within
phrases are not affected.
We make the following conditional independence
assumption:
P(yK1 |xK1 ,uK1 ,K,eI1) = P(yK1 |xK1 ,uK1 ). (2)
Given an input phrase sequence xK1 we now as-
sociate a unique jump sequence bK1 with each per-
missible output phrase sequence yK1 . The jump bk
measures the displacement of the kth phrase xk, i.e.
xk → yk+bk,k ∈ {1,2,...,K}. (3)
The jump sequence bK1 is constructed such that yK1
is a permutation of xK1 . This is enforced by con-
structing all models so that summationtextKk=1 bk = 0.
We now redefine the model in terms of the jump
sequence
P(yK1 |xK1 ,uK1 ) (4)
=
braceleftBigg
P(bK1 |xK1 ,uK1 ) yk+bk = xk ∀k
0 otherwise,
162
x2 x3 x4 x5x1
y2 y3 y4 y5y1
3b = 01b = +12b = −1 4b = 0 5b = 0
doivent de_25_%exportations fléchir
exportationsgrains de_25_%doiventfléchir
grains
Figure 2: Phrase reordering and jump sequence.-
where yK1 is determined by xK1 and bK1 .
Each jump bk depends on the phrase-pair (xk,uk)
and preceding jumps bk−11
P(bK1 |xK1 ,uK1 ) =
Kproductdisplay
k=1
P(bk|xk,uk,φk−1), (5)
where φk−1 is an equivalence classification (state)
of the jump sequence bk−11 .
The jump sequence bK1 can be described by a
deterministic finite state machine. φ(bk−11 ) is the
state arrived at by bk−11 ; we will use φk−1 to denote
φ(bk−11 ).
We will investigate phrase reordering by restrict-
ing the maximum allowable jump to 1 phrase and
to 2 phrases; we will refer to these reordering
models as MJ-1 and MJ-2. In the first case,
bk ∈ {0,+1,−1} while in the second case, bk ∈
{0,+1,−1,+2,−2}.
2.2 Reordering WFST for MJ-1
We first present the Finite State Machine of the
phrase reordering process (Fig 3) which has two
equivalence classes (FSM states) for any given his-
tory bk−11 ; φ(bk−11 ) ∈ {1,2}. A jump of +1 has to
be followed by a jump of −1, and 1 is the start and
end state; this ensures summationtextKk=1 bk = 0.
1 b=+1
b=−1
b=0 
2
Figure 3: Phrase reordering process for MJ-1.
Under this restriction, the probability of the jump
bk (Eqn 5) can be simplified as
P(bk|xk,uk,φ(bk−11 )) = (6)


β1(xk,uk) bk = +1,φk−1 = 1
1 −β1(xk,uk) bk = 0,φk−1 = 1
1 bk = −1,φk−1 = 2.
There is a single parameter jump probability
β1(x,u) = P(b = +1|x,u) associated with each
phrase-pair (x,u) in the phrase-pair inventory. This
is the probability that the phrase-pair (x,u) appears
out of order in the transformed phrase sequence.
We now describe the MJ-1 WFST. In the presen-
tation, we use upper-case letters to denote the En-
glish phrases (uk) and lower-case letters to denote
the French phrases (xk and yk).
The PPI for this example is given in Table 1.
English French Parameters
u x P(x|u) β1(x,u)
A a 0.5 0.2
A d 0.5 0.2
B b 1.0 0.4
C c 1.0 0.3
D d 1.0 0.8
Table 1: Example phrase-pair inventory with trans-
lation and reordering probabilities.
The input to the WFST (Fig 4) is a lattice of
French phrase sequences derived from the French
sentence to be translated. The outputs are the cor-
responding English phrase sequences. Note that the
reordering is performed on the English side.
The WFST is constructed by adding a self-loop
for each French phrase in the input lattice, and
a 2-arc path for every pair of adjacent French
phrases in the lattice. The WFST incorporates the
translation model P(x|u) and the reordering model
P(b|x,u). The score on a self-loop with labels
(u,x) is P(x|u) × (1 − β1(x,u)); on a 2-arc path
with labels (u1,x1) and (u2,x2), the score on the
1st arc is P(x2|u1) ×β1(x2,u1) and on the 2nd arc
is P(x1|u2).
In this example, the input to this transducer is a
single French phrase sequence V : a,b,c. We per-
form the WFST composition R◦V , project the result
on the input labels, and remove the epsilons to form
the acceptor (R◦V )1 which contains the six English
phrase sequences (Fig 4).
Translation Given a French sentence, a lattice of
translations is obtained using the weighted finite
state composition: T = G ◦ W ◦ R ◦ Φ ◦ Ω ◦ T.
The most-likely translation is obtained as the path
with the highest probability in T .
Alignment Given a sentence-pair (E,F), a lattice
of phrase alignments is obtained by the finite state
composition: B = S ◦ W ◦ R ◦ Φ ◦ Ω ◦ T, where
163
A : b / 0.1
A B D 0.4 x 0.6 x 0.2 = 0.480B A D 0.4 x 0.5 x 0.2 = 0.040
A D B 0.4 x 0.8 x 0.4 = 0.128A A B 0.4 x 0.1 x 0.4 = 0.016
A B A 0.4 x 0.6 x 0.4 = 0.096B A A 0.4 x 0.5 x 0.4 = 0.080
VR 1( )
A : b / 0.5
R
V a b d
B : b / 0.6D : d / 0.2A : d / 0.4
A : a / 0.4
B : a / 0.4
B : d / 0.4
D : b / 0.8
Figure 4: WFST for the MJ-1 model.
S is an acceptor for the English sentence E, and
T is an acceptor for the French sentence F. The
Viterbi alignment is found as the path with the high-
est probability in B. The WFST composition gives
the word-to-word alignments between the sentences.
However, to obtain the phrase alignments, we need
to construct additional FSTs not described here.
2.3 Reordering WFST for MJ-2
MJ-2 reordering restricts the maximum allowable
jump to 2 phrases and also insists that the reorder-
ing take place within a window of 3 phrases. This
latter condition implies that for an input sequence
{a,b,c,d}, we disallow the three output sequences:
{b,d,a,c; c,a,d,b; c,d,a,b;}. In the MJ-2 finite
state machine, a given history bk−11 can lead to one
of the six states in Fig 5.
b=0
1
23
45
6
b=−1
b=+1b=−1
b=+2b=0
b=−2
b=−1
b=+1 b=−2
Figure 5: Phrase reordering process for MJ-2.
The jump probability of Eqn 5 becomes
P(bk|xk,uk,φk−1) =


β1(xk,uk) bk = 1,φk−1 = 1
β2(xk,uk) bk = 2,φk−1 = 1braceleftBigg
1 −β1(xk,uk)
−β2(xk,uk) bk = 0,φk−1 = 1
(7)
braceleftBigg
β1(xk,uk) bk = 1,φk−1 = 2
1 −β1(xk,uk) bk = −1,φk−1 = 2 (8)
braceleftBigg
0.5 bk = 0,φk−1 = 3
0.5 bk = −1,φk−1 = 3. (9)
braceleftBig
1 bk = −2,φk−1 = 4 (10)
braceleftBig
1 bk = −2,φk−1 = 5 (11)
braceleftBig
1 bk = −1,φk−1 = 6 (12)
We note that the distributions (Eqns 7 and 8) are
based on two parameters β1(x,u) and β2(x,u) for
each phrase-pair (x,u).
Suppose the input is a phrase sequence a,b,c, the
MJ-2 model (Fig 5) allows 6 possible reorderings:
a,b,c;a,c,b;b,a,c;b,c,a;c,a,b;c,b,a. The distri-
bution Eqn 9 ensures that the sequences b,c,a and
c,b,a are assigned equal probability. The distribu-
tions in Eqns 10-12 ensure that the maximum jump
is 2 phrases and the reordering happens within a
window of 3 phrases. By insisting that the pro-
cess start and end at state 1 (Fig 5), we ensure that
the model is not deficient. A WFST implementing
the MJ-2 model can be easily constructed for both
phrase alignment and translation, following the con-
struction described for the MJ-1 model.
3 Estimation of the Reordering Models
The Translation Template Model relies on an in-
ventory of target language phrases and their source
language translations. Our goal is to estimate the
reordering model parameters P(b|x,u) for each
phrase-pair (x,u) in this inventory. However, when
translating a given test set, only a subset of the
phrase-pairs is needed. Although there may be an
advantage in estimating the model parameters under
an inventory that covers all the training bitext, we fix
the phrase-pair inventory to cover only the phrases
on the test set. Estimation of the reordering model
parameters over the training bitext is then performed
under this test-set specific inventory.
164
We employ the EM algorithm to obtain Maximum
Likelihood (ML) estimates of the reordering model
parameters. Applying EM to the MJ-1 reordering
model gives the following ML parameter estimates
for each phrase-pair (u,x).
ˆβ1(x,u) = Cx,u(0,+1)
Cx,u(0,+1) + Cx,u(0,0). (13)
Cx,u(φ,b) is defined for φ = 1,2 and b =
−1,0,+1. Any permissible phrase alignment of a
sentence pair corresponds to a bK1 sequence, which
in turn specifies a φK1 sequence. Cx,u(φ,b) is the
expected number of times the phrase-pair x,u is
aligned with a jump of b phrases when the jump his-
tory is φ. We do not use full EM but a Viterbi train-
ing procedure that obtains the counts for the best
(Viterbi) alignments. If a phrase-pair (x,u) is never
seen in the Viterbi alignments, we back-off to a flat
parameter β1(x,u) = 0.05.
The ML parameter estimates for the MJ-2 model
are given in Table 2, with Cx,u(φ,b) defined sim-
ilarly. In our training scenario, we use WFST op-
erations to obtain Viterbi phrase alignments of the
training bitext where the initial reordering model
parameters (β0(x,u)) are set to a uniform value of
0.05. The counts Cx,u(s,b) are then obtained over
the phrase alignments. Finally the ML estimates of
the parameters are computed using Eqn 13 (MJ-1) or
Eqn 14 (MJ-2). We will refer to the Viterbi trained
models as MJ-1 VT and MJ-2 VT. Table 3 shows the
MJ-1 VT parameters for some example phrase-pairs
in the Arabic-English (A-E) task.
u x β1(x,u)
which is the closest Aqrb 1.0
international trade tjArp EAlmyp 0.8
the foreign ministry wzArp xArjyp 0.6
arab league jAmEp dwl Erbyp 0.4
Table 3: MJ-1 parameters for A-E phrase-pairs.
To validate alignment under a PPI, we mea-
sure performance of the TTM word alignments
on French-English (500 sent-pairs) and Chinese-
English (124 sent-pairs) (Table 4). As desired, the
Alignment Recall (AR) and Alignment Error Rate
(AER) improve modestly while Alignment Preci-
sion (AP) remains constant. This suggests that the
models allow more words to be aligned and thus im-
prove the recall; MJ-2 gives a further improvement
in AR and AER relative to MJ-1. Alignment preci-
Reordering Metrics (%)
Frn-Eng Chn-Eng
AP AR AER AP AR AER
None 94.2 84.8 10.0 85.1 47.1 39.3
MJ-1 VT 94.1 86.8 9.1 85.3 49.4 37.5
MJ-2 VT 93.9 87.4 8.9 85.3 50.9 36.3
Table 4: Alignment Performance with Reordering.
sion depends on the quality of the word alignments
within the phrase-pairs and does not change much
by allowing phrase reordering. This experiment val-
idates the estimation procedure based on the phrase
alignments; however, we do not advocate the use of
TTM as an alternate word alignment technique.
4 Translation Experiments
We perform our translation experiments on the large
data track of the NIST Arabic-to-English (A-E) and
Chinese-to-English (C-E) MT tasks; we report re-
sults on the NIST 2002, 2003, and 2004 evaluation
test sets 1.
4.1 Exploratory Experiments
In these experiments the training data is restricted to
FBIS bitext in C-E and the news bitexts in A-E. The
bitext consists of chunk pairs aligned at sentence
and sub-sentence level (Deng et al., 2004). In A-E,
the training bitext consists of 3.8M English words,
3.2M Arabic words and 137K chunk pairs. In C-E,
the training bitext consists of 11.7M English words,
8.9M Chinese words and 674K chunk pairs.
Our Chinese text processing consists of word seg-
mentation (using the LDC segmenter) followed by
grouping of numbers. For Arabic our text pro-
cessing consisted of a modified Buckwalter analysis
(LDC2002L49) followed by post processing to sep-
arate conjunctions, prepostions and pronouns, and
Al-/w- deletion. The English text is processed us-
ing a simple tokenizer based on the text processing
utility available in the the NIST MT-eval toolkit.
The Language Model (LM) training data consists
of approximately 400M words of English text de-
rived from Xinhua and AFP (English Gigaword), the
English side of FBIS, the UN and A-E News texts,
and the online archives of The People’s Daily.
Table 5 gives the performance of the MJ-1 and
MJ-2 reordering models when translation is per-
formed using a 4-gram LM. We report performance
on the 02, 03, 04 test sets and the combined test set
1http://www.nist.gov/speech/tests/mt/
165
ˆβ1(x,u) = Cx,u(1,+1) + Cx,u(2,+1)
Cx,u(1,+1) + Cx,u(1,0) + Cx,u(1,+2) + Cx,u(2,+1) + Cx,u(2,−1)
ˆβ2(x,u) = (Cx,u(1,0) + Cx,u(2,−1) + Cx,u(1,+2))Cx,u(1,+2)
(Cx,u(1,+1) + Cx,u(1,0) + Cx,u(1,+2) + Cx,u(2,+1) + Cx,u(2,−1))(Cx,u(1,+2) + Cx,u(1,0))
Table 2: ML parameter estimates for MJ-2 model.
Reordering BLEU (%)
Arabic-English Chinese-English
02 03 04 ALL 02 03 04 ALL
None 37.5 40.3 36.8 37.8 ± 0.6 24.2 23.7 26.0 25.0 ± 0.5
MJ-1 flat 40.4 43.9 39.4 40.7 ± 0.6 25.7 24.5 27.4 26.2 ± 0.5
MJ-1 VT 41.3 44.8 40.3 41.6 ± 0.6 25.8 24.5 27.8 26.5 ± 0.5
MJ-2 flat 41.0 44.4 39.7 41.1 ± 0.6 26.4 24.9 27.7 26.7 ± 0.5
MJ-2 VT 41.7 45.3 40.6 42.0 ± 0.6 26.5 24.9 27.9 26.8 ± 0.5
Table 5: Performance of MJ-1 and MJ-2 reordering models with a 4-gram LM.
(ALL=02+03+04). For the combined set (ALL), we
also show the 95% BLEU confidence interval com-
puted using bootstrap resampling (Och, 2003).
Row 1 gives the performance when no reorder-
ing model is used. The next two rows show the in-
fluence of the MJ-1 reordering model; in row 2, a
flat probability of β1(x,u) = 0.05 is used for all
phrase-pairs; in row 3, a reordering probability is
estimated for each phrase-pair using Viterbi Train-
ing (Eqn 13). The last two rows show the effect of
the MJ-2 reordering model; row 4 uses flat proba-
bilities (β1(x,u) = 0.05,β2(x,u) = 0.01) for all
phrase-pairs; row 5 applies reordering probabilities
estimating with Viterbi Training for each phrase-pair
(Table 2).
On both language-pairs, we observe that reorder-
ing yields significant improvements. The gains from
phrase reordering are much higher on A-E relative
to C-E; this could be related to the fact that the word
order differences between English and Arabic are
much higher than the differences between English
and Chinese. MJ-1 VT outperforms flat MJ-1 show-
ing that there is value in estimating the reordering
parameters from bitext. Finally, the MJ-2 VT model
performs better than the flat MJ-2 model, but only
marginally better than the MJ-1 VT model. There-
fore estimation does improve the MJ-2 model but
allowing reordering beyond a window of 1 phrase is
not useful when translating either Arabic or Chinese
into English in this framework.
The flat MJ-1 model outperforms the no-
reordering case and the flat MJ-2 model is better
than the flat MJ-1 model; we hypothesize that phrase
reordering increases search space of translations that
allows the language model to select a higher qual-
ity hypothesis. This suggests that these models of
phrase reordering actually require strong language
models to be effective. We now investigate the inter-
action between language models and reordering.
Our goal here is to measure translation perfor-
mance of reordering models over variable span n-
gram LMs (Table 6). We observe that both MJ-1
and MJ-2 models yield higher improvements under
higher order LMs: e.g. on A-E, gains under 3g
(3.6 BLEU points on MJ-1, 0.2 points on MJ-2) are
higher than the gains with 2g (2.4 BLEU points on
MJ-1, 0.1 points on MJ-2).
Reordering BLEU (%)
A-E C-E
2g 3g 4g 2g 3g 4g
None 21.0 36.8 37.8 16.1 24.8 25.0
MJ-1 VT 23.4 40.4 41.6 16.2 25.9 26.5
MJ-2 VT 23.5 40.6 42.0 16.0 26.1 26.8
Table 6: Reordering with variable span n-gram LMs
on Eval02+03+04 set.
We now measure performance of the reorder-
ing models across the three test set genres used in
the NIST 2004 evaluation: news, editorials, and
speeches. On A-E, MJ-1 and MJ-2 yield larger im-
provements on News relative to the other genres;
on C-E, the gains are larger on Speeches and Ed-
itorials relative to News. We hypothesize that the
Phrase-Pair Inventory, reordering models and lan-
guage models could all have been biased away from
the test set due to the training data. There may also
be less movement across these other genres.
166
Reordering BLEU (%)
A-E C-E
News Eds Sphs News Eds Sphs
None 41.1 30.8 33.3 23.6 25.9 30.8
MJ-1 VT 45.6 32.6 35.7 24.8 27.8 33.3
MJ-2 VT 46.2 32.7 35.5 24.8 27.8 33.7
Table 7: Performance across Eval 04 test genres.
BLEU (%)
Arabic-English Chinese-English
Reordering 02 03 04n 02 03 04n
None 40.2 42.3 43.3 28.9 27.4 27.3
MJ-1 VT 43.1 45.0 45.6 30.2 28.2 28.9
MET-Basic 44.8 47.2 48.2 31.3 30.3 30.3
MET-IBM1 45.2 48.2 49.7 31.8 30.7 31.0
Table 8: Translation Performance on Large Bitexts.
4.2 Scaling to Large Bitext Training Sets
We here describe the integration of the phrase re-
ordering model in an MT system trained on large
bitexts. The text processing and language mod-
els have been described in § 4.1. Alignment Mod-
els are trained on all available bitext (7.6M chunk
pairs/207.4M English words/175.7M Chinese words
on C-E and 5.1M chunk pairs/132.6M English
words/123.0M Arabic words on A-E), and word
alignments are obtained over the bitext. Phrase-pairs
are then extracted from the word alignments (Koehn
et al., 2003). MJ-1 model parameters are estimated
over all bitext on A-E and over the non-UN bitext
on C-E. Finally we use Minimum Error Training
(MET) (Och, 2003) to train log-linear scaling fac-
tors that are applied to the WFSTs in Equation 1.
04news (04n) is used as the MET training set.
Table 8 reports the performance of the system.
Row 1 gives the performance without phrase re-
ordering and Row 2 shows the effect of the MJ-1
VT model. The MJ-1 VT model is used in an initial
decoding pass with the four-gram LM to generate
translation lattices. These lattices are then rescored
under parameters obtained using MET (MET-basic),
and 1000-best lists are generated. The 1000-best
lists are augmented with IBM Model-1 (Brown et
al., 1993) scores and then rescored with a second set
of MET parameters. Rows 3 and 4 show the perfor-
mance of the MET-basic and MET-IBM1 models.
We observe that the maximum likelihood phrase
reordering model (MJ-1 VT) yields significantly im-
proved translation performance relative to the mono-
tone phrase order translation baseline. This confirms
the translation performance improvements found
over smaller training bitexts.
We also find additional gains by applying MET to
optimize the scaling parameters that are applied to
the WFST component distributions within the TTM
(Equation 1). In this procedure, the scale factor ap-
plied to the MJ-1 VT Phrase Translation and Re-
ordering component is estimated along with scale
factors applied to the other model components; in
other words, the ML-estimated phrase reordering
model itself is not affected by MET, but the likeli-
hood that it assigns to a phrase sequence is scaled
by a single, discriminatively optimized weight. The
improvements from MET (see rows MET-Basic and
MET- IBM1) demonstrate that the MJ-1 VT reorder-
ing models can be incorporated within a discrimi-
native optimized translation system incorporating a
variety of models and estimation procedures.
5 Discussion
In this paper we have described local phrase reorder-
ing models developed for use in statistical machine
translation. The models are carefully formulated
so that they can be implemented as WFSTs, and
we show how the models can be incorporated into
the Translation Template Model to perform phrase
alignment and translation using standard WFST op-
erations. Previous approaches to WFST-based re-
ordering (Knight and Al-Onaizan, 1998; Kumar
and Byrne, 2003; Tsukada and Nagata, 2004) con-
structed permutation acceptors whose state spaces
grow exponentially with the length of the sentence to
be translated. As a result, these acceptors have to be
pruned heavily for use in translation. In contrast, our
models of local phrase movement do not grow ex-
plosively and do not require any pruning or approx-
imation in their construction. In other related work,
Bangalore and Ricardi (2001) have trained WF-
STs for modeling reordering within translation; their
WFST parses word sequences into trees containing
reordering information, which are then checked for
well-formed brackets. Unlike this approach, our
model formulation does not use a tree representation
and also ensures that the output sequences are valid
permutations of input phrase sequences; we empha-
size again that the probability distribution induced
over reordered phrase sequences is not degenerate.
Our reordering models do resemble those of (Till-
mann, 2004; Tillmann and Zhang, 2005) in that we
167
treat the reordering as a sequence of jumps relative
to the original phrase sequence, and that the likeli-
hood of the reordering is assigned through phrase-
pair specific parameterized models. We note that
our implementation allows phrase reordering be-
yond simply a 1-phrase window, as was done by Till-
mann. More importantly, our model implements a
generative model of phrase reordering which can be
incorporated directly into a generative model of the
overall translation process. This allows us to per-
form ‘embedded’ EM-style parameter estimation,
in which the parameters of the phrase reordering
model are estimated using statistics gathered under
the complete model that will actually be used in
translation. We believe that this estimation of model
parameters directly from phrase alignments obtained
under the phrase translation model is a novel contri-
bution; prior approaches derived the parameters of
the reordering models from word aligned bitext, e.g.
within the phrase pair extraction procedure.
We have shown that these models yield improve-
ments in alignment and translation performance on
Arabic-English and Chinese-English tasks, and that
the reordering model can be integrated into large
evaluation systems. Our experiments show that dis-
criminative training procedures such Minimum Er-
ror Training also yield additive improvements by
tuning TTM systems which incorporate ML-trained
reordering models. This is essential for integrating
our reordering model inside an evaluation system,
where a variety of techniques are applied simultane-
ously.
The MJ-1 and MJ-2 models are extremely sim-
ple models of phrase reordering. Despite their sim-
plicity, these models provide large improvements
in BLEU score when incorporated into a monotone
phrase order translation system. Moreover, they
can be used to produced translation lattices for use
by more sophisticated reordering models that allow
longer phrase order movement. Future work will
build on these simple structures to produce more
powerful models of word and phrase movement in
translation.
References
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and
R. L. Mercer. 1993. The mathematics of statistical
machine translation: Parameter estimation. Computa-
tional Linguistics, 19(2):263–311.
Y. Deng, S. Kumar, and W. Byrne. 2004. Bitext chunk
alignment for statistical machine translation. In Re-
search Note, Center for Language and Speech Pro-
cessing, Johns Hopkins University.
K. Knight and Y. Al-Onaizan. 1998. Translation
with finite-state devices. In AMTA, pages 421–437,
Langhorne, PA, USA.
K. Knight. 1999. Decoding complexity in word-
replacement translation models. Computational Lin-
guistics, Squibs & Discussion, 25(4).
P. Koehn, F. Och, and D. Marcu. 2003. Statistical phrase-
based translation. In HLT-NAACL, pages 127–133,
Edmonton, Canada.
S. Kumar and W. Byrne. 2003. A weighted finite state
transducer implementation of the alignment template
model for statistical machine translation. In HLT-
NAACL, pages 142–149, Edmonton, Canada.
S. Kumar, Y. Deng, and W. Byrne. 2005. A weighted fi-
nite state transducer translation template model for sta-
tistical machine translation. Journal of Natural Lan-
guage Engineering, 11(4).
F. Och, C. Tillmann, and H. Ney. 1999. Improved align-
ment models for statistical machine translation. In
EMNLP-VLC, pages 20–28, College Park, MD, USA.
F. Och. 2003. Minimum error rate training in statistical
machine translation. In ACL, Sapporo, Japan.
C. Tillmann and T. Zhang. 2005. A localized prediction
model for statistical machine translation. In ACL, Ann
Arbor, Michigan, USA.
C. Tillmann. 2004. A block orientation model for sta-
tistical machine translation. In HLT-NAACL, Boston,
MA, USA.
H. Tsukada and M. Nagata. 2004. Efficient decoding for
statistical machine translation with a fully expanded
WFST model. In EMNLP, Barcelona, Spain.
S. Vogel. 2003. SMT Decoder Dissected: Word Reorder-
ing. In NLPKE, Beijing, China.
D. Wu. 1996. A polynomial-time algorithm for sta-
tistical machine translation. In ACL, pages 152–158,
Santa Cruz, CA, USA.
R. Zens and H. Ney. 2003. A comparative study on re-
ordering constraints in statistical machine translation.
In ACL, pages 144–151, Sapporo, Japan.
R. Zens, H. Ney, T. Watanabe, and E. Sumita. 2004.
Reordering constraints for phrase-based statistical ma-
chine translation. In COLING, pages 205–211,
Boston, MA, USA.
168
