A Unigram Orientation Model for Statistical Machine Translation
Christoph Tillmann
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598
ctill@us.ibm.com
Abstract
In this paper, we present a unigram segmen-
tation model for statistical machine transla-
tion where the segmentation units are blocks:
pairs of phrases without internal structure. The
segmentation model uses a novel orientation
component to handle swapping of neighbor
blocks. During training, we collect block un-
igram counts with orientation: we count how
often a block occurs to the left or to the right of
some predecessor block. The orientation model
is shown to improve translation performance
over two models: 1) no block re-ordering is
used, and 2) the block swapping is controlled
only by a language model. We show exper-
imental results on a standard Arabic-English
translation task.
1 Introduction
In recent years, phrase-based systems for statistical ma-
chine translation (Och et al., 1999; Koehn et al., 2003;
Venugopal et al., 2003) have delivered state-of-the-art
performance on standard translation tasks. In this pa-
per, we present a phrase-based unigram system similar
to the one in (Tillmann and Xia, 2003), which is ex-
tended by an unigram orientation model. The units of
translation are blocks, pairs of phrases without internal
structure. Fig. 1 shows an example block translation us-
ing five Arabic-English blocks a0a2a1a2a3a5a4a5a4a5a4a5a3a6a0a8a7 . The unigram
orientation model is trained from word-aligned training
data. During decoding, we view translation as a block
segmentation process, where the input sentence is seg-
mented from left to right and the target sentence is gener-
ated from bottom to top, one block at a time. A monotone
block sequence is generated except for the possibility to
swap a pair of neighbor blocks. The novel orientation
model is used to assist the block swapping: as shown in
b1
Lebanese
violate
warplanes
Israeli
A
l
T
A}
r 
A
t
}
A
l
H
r
by
P
A
l
A
s
r
A
y
ly
P
t
n
t
h
k
airspace
l
l
b
n
A
n
y
A
l
m
j
A
l
A
l
j
w
y
A
b2
b3
b4
b5
Figure 1: An Arabic-English block translation example
taken from the devtest set. The Arabic words are roman-
ized.
section 3, block swapping where only a trigram language
model is used to compute probabilities between neighbor
blocks fails to improve translation performance. (Wu,
1996; Zens and Ney, 2003) present re-ordering models
that make use of a straight/inverted orientation model that
is related to our work. Here, we investigate in detail
the effect of restricting the word re-ordering to neighbor
block swapping only.
In this paper, we assume a block generation process that
generates block sequences from bottom to top, one block
at a time. The score of a successor block a0 depends on its
predecessor block a0a10a9 and on its orientation relative to the
block a0a8a9 . In Fig. 1 for example, block a0a2a1 is the predeces-
sor of block a0a10a11 , and block a0a10a11 is the predecessor of block
a0a8a12 . The target clump of a predecessor block a0a5a9 is adja-
cent to the target clump of a successor block a0 . A right
adjacent predecessor block a0a10a9 is a block where addition-
ally the source clumps are adjacent and the source clump
of a0a8a9 occurs to the right of the source clump of a0 . A left
adjacent predecessor block is defined accordingly.
During decoding, we compute the score a0a2a1
a0a4a3
a1
a3a6a5a7a3
a1a9a8 of a
block sequence a0a10a3a1 with orientation a5a11a3a1 as a product of
block bigram scores:
a0a2a1
a0 a3
a1
a3a6a5 a3
a1a12a8a14a13
a3
a15
a16a17
a1a19a18
a1
a0
a16
a3a6a5
a16a21a20
a0
a16a23a22
a1 a3a6a5
a16a23a22
a1
a8
a3 (1)
where a0 a16 is a block anda5 a16a25a24a27a26a19a28 a3a6a29 a3a6a30a32a31 is a three-valued
orientation component linked to the block a0 a16 (the orienta-
tion a5 a16a23a22 a1 of the predecessor block is ignored.). A block
a0
a16 has right orientation (
a5
a16a34a33
a29 ) if it has a left adjacent
predecessor. Accordingly, a block a0 a16 has left orientation
(a5 a16a35a33a36a28 ) if it has a right adjacent predecessor. If a block
has neither a left or right adjacent predecessor, its orien-
tation is neutral (a5 a16a37a33 a30 ). The neutral orientation is not
modeled explicitly in this paper, rather it is handled as a
default case as explained below. In Fig. 1, the orienta-
tion sequence is a5
a7
a1
a33
a1
a30 a3
a28
a3a6a30 a3a6a30 a3
a28
a8 , i.e. block
a0a10a11 and
block a0a8a7 are generated using left orientation. During de-
coding most blocks have right orientation a1
a5
a33
a29
a8 , since
the block translations are mostly monotone.
We try to find a block sequence with orientation a1
a0a4a3
a1
a3a6a5a7a3
a1a12a8
that maximizes a0a2a1
a0a10a3
a1
a3a6a5a7a3
a1a12a8 . The following three types
of parameters are used to model the block bigram score
a18
a1
a0
a16
a3a6a5
a16a21a20
a0
a16a23a22
a1 a3a6a5
a16a23a22
a1
a8 in Eq. 1:
a38 Two unigram count-based models:
a18
a1
a0
a8 and
a18a40a39
a1
a5
a8 . We compute the unigram probability
a18
a1
a0
a8 of
a block based on its occurrence count a30 a1 a0 a8 . The
blocks are counted from word-aligned training data.
We also collect unigram counts with orientation: a
left count a30a42a41 a1 a0 a8 and a right count a30a44a43 a1 a0 a8 . These
counts are defined via an enumeration process and
are used to define the orientation model
a18a45a39
a1
a5
a8 :
a18a40a39
a1
a5
a24a46a26a19a28
a3a6a29a42a31
a8
a33
a30a44a47
a1
a0
a8
a30a42a41
a1
a0
a8a49a48
a30a42a43
a1
a0
a8a12a50
a38 Trigram language model: The block language
model score
a18
a1
a0
a16a6a20
a0
a16a23a22
a1
a8 is computed as the proba-
bility of the first target word in the target clump of
a0
a16 given the final two words of the target clump of
a0
a16a23a22
a1 .
The three models are combined in a log-linear way, as
shown in the following section.
2 Orientation Unigram Model
The basic idea of the orientation model can be illustrated
as follows: In the example translation in Fig. 1, block a0a5a11
occurs to the left of block a0a2a1 . Although the joint block
a1
a0a8a11 a3a6a0a5a1
a8 consisting of the two smaller blocks
a0a2a1 and a0a8a11
has not been seen in the training data, we can still profit
from the fact that block a0a10a11 occurs more frequently with
left than with right orientation. In our Arabic-English
training data, block a0a10a11 has been seena30a44a41 a1 a0a8a11 a8 a33a52a51a54a53 times
with left orientation, and a30a44a43 a1 a0a8a11 a8 a33a56a55 with right orien-
tation, i.e. it is always involved in swapping. This intu-
ition is formalized using unigram counts with orientation.
The orientation model is related to the distortion model
in (Brown et al., 1993), but we do not compute a block
alignment during training. We rather enumerate all rele-
vant blocks in some order. Enumeration does not allow
us to capture position dependent distortion probabilities,
but we can compute statistics about adjacent block prede-
cessors.
Our baseline model is the unigram monotone model de-
scribed in (Tillmann and Xia, 2003). Here, we select
blocks a0 from word-aligned training data and unigram
block occurrence counts a30 a1
a0
a8 are computed: all blocks
for a training sentence pair are enumerated in some order
and we count how often a given block occurs in the par-
allel training data 1. The training algorithm yields a list
of about a57
a51 blocks per training sentence pair. In this pa-
per, we make extended use of the baseline enumeration
procedure: for each block a0 , we additionally enumerate
all its left and right predecessors a0a10a9 . No optimal block
segmentation is needed to compute the predecessors: for
each block a0 , we check for adjacent predecessor blocks a0 a9
that also occur in the enumeration list. We compute left
orientation countsa30a44a41 a1
a0
a8 as follows:
a30a42a41
a1
a0
a8
a33 a58
a59 a0 a9 right adjacent predecessor of a0
a60
a50
Here, we enumerate all adjacent predecessors a0a5a9 of block
a0 over all training sentence pairs. The identity of a0a5a9 is ig-
nored. a30a42a41 a1
a0
a8 is the number of times the block
a0 succeeds
some right adjacent predecessor block a0a5a9 . The ’right’ ori-
entation count a30a44a43 a1
a0
a8 is defined accordingly. Note, that
in general the unigram count a30 a1
a0
a8a62a61
a33
a30a44a41
a1
a0
a8a37a48
a30a42a43
a1
a0
a8 :
during enumeration, a block a0 might have both left and
right adjacent predecessors, either a left or a right adja-
cent predecessor, or no adjacent predecessors at all. The
orientation count collection is illustrated in Fig. 2: each
time a block a0 has a left or right adjacent predecessor in
the parallel training data, the orientation counts are incre-
mented accordingly.
The decoding orientation restrictions are illustrated in
Fig 3: a monotone block sequence with right (a5 a33 a29 )
1We keep all blocks for which
a63a65a64a67a66a69a68a2a70a72a71 and the phrase
length is less or equala73 . No other selection criteria are applied.
For thea74a76a75a37a77a79a78a81a80 model, we keep all blocks for whicha63a65a64a67a66a69a68a76a70
a82 .
N (b) += 1L
b
b’ b’
N (b) += 1R
b
Figure 2: During training, blocks are enumerated in some
order: for each block a0 , we look for left and right adjacent
predecessors a0a10a9 .
orientation is generated. If a block is skipped e.g. block
a0a8a12 in Fig 3 by first generating block a0a10a11 then block a0a8a12 , the
block a0a8a12 is generated using left orientationa5 a12 a33 a28 . Since
the block translation is generated from bottom-to-top, the
blocks a0a8a11 and a0 a1 do not have adjacent predecessors below
them: they are generated by a default model
a18
a1
a0
a16a6a20
a0
a16a23a22
a1
a8
without orientation component. The orientation model
is given in Eq. 2, the default model is given in Eq. 3.
The block bigram model
a18
a1
a0
a16
a3a6a5
a16a35a24a46a26a19a28
a3a6a29a42a31
a20
a0
a16a23a22
a1a5a3a6a5
a16a23a22
a1
a8 in
Eq. 1 is defined as:
a18
a1
a0
a16
a3a6a5
a16a35a24 a26a19a28
a3a6a29a42a31
a20
a0
a16a23a22
a1 a3a6a5
a16a23a22
a1
a8
a33 (2)
a33
a18
a1
a0
a16
a8a3a2a5a4
a4
a18
a1
a0
a16a21a20
a0
a16a23a22
a1
a8a3a2a7a6
a4
a18a40a39a9a8
a1
a5
a16
a8a3a2a5a10
a3
where a11a13a12
a48
a11
a1
a48
a11
a11
a33
a60
a50
a55 and the orientation
a5
a16a23a22
a1 of the
predecessor is ignored. The a11
a16 are chosen to be optimal
on the devtest set (the optimal parameter setting is shown
in Table. 1). Only two parameters have to be optimized
due to the constraint that the a11 a16 have to sum to a60
a50
a55 . The
default model
a18
a1
a0
a16
a3a6a5
a16 a33
a30
a20
a0
a16a23a22
a1 a3a6a5
a16a23a22
a1
a8
a33
a18
a1
a0
a16a6a20
a0
a16a23a22
a1
a8 is
defined as:
a18
a1
a0
a16a6a20
a0
a16a23a22
a1
a8
a33
a18
a1
a0
a16
a8a3a2
a14
a4
a4
a18
a1
a0
a16a6a20
a0
a16a23a22
a1
a8a3a2
a14
a6
a3 (3)
where a11
a14
a12
a48
a11
a14
a1
a33
a60
a50
a55 . The
a11
a14
a16 are not optimized sepa-
rately, rather we define: a11
a14
a12
a33
a2a5a4
a2a5a4a3a15a13a2a7a6
.
Straightforward normalization over all successor blocks
in Eq. 2 and in Eq. 3 is not feasible: there are tens of mil-
lions of possible successor blocks a0 . In future work, nor-
malization over a restricted successor set, e.g. for a given
source input sentence, all blocks a0 that match this sen-
tence might be useful for both training and decoding. The
segmentation model in Eq. 1 naturally prefers translations
that make use of a smaller number of blocks which leads
to a smaller number of factors in Eq. 1. Using fewer ’big-
ger’ blocks to carry out the translation generally seems
to improve translation performance. Since normalization
does not influence the number of blocks used to carry out
the translation, it might be less important for our segmen-
tation model.
We use a DP-based beam search procedure similar to the
one presented in (Tillmann and Xia, 2003). We maximize
o =R1
b
b3
b4
b1
2
o =R2
o =R3
o =R4
b4
b3
b2b1
o =R1
o =N2
o =L3
o =N4
Figure 3: During decoding, a mostly monotone block se-
quence with a1
a5
a16a81a33
a29
a8 orientation is generated as shown
in the left picture. In the right picture, block swapping
generates block a0a10a12 to the left of block a0a10a11 . The blocks a0a10a11
and a0 a1 do not have a left or right adjacent predecessor.
over all block segmentations with orientationa1 a0a4a3a1 a3a6a5a7a3a1a9a8 for
which the source phrases yield a segmentation of the in-
put sentence. Swapping involves only blocks a1
a0 a3a6a0a5a9
a8 for
which a30a42a41 a1 a0 a8a17a16a19a18 for the successor block a0 , e.g. the
blocks a0a8a11 and a0a8a7 in Fig 1. We tried several thresholds for
a30a42a41
a1
a0
a8 , and performance is reduced significantly only if
a30a42a41
a1
a0
a8a21a20a22a18
a55 . No other parameters are used to control
the block swapping. In particular the orientationa5 a9 of the
predecessor block a0a10a9 is ignored: in future work, we might
take into account that a certain predecessor block a0 a9 typi-
cally precedes other blocks.
3 Experimental Results
The translation system is tested on an Arabic-to-English
translation task. The training data comes from the UN
news sources: a23a7a24
a50
a51 million Arabic and
a25a7a24
a50
a60 million En-
glish words. The training data is sentence-aligned yield-
ing a18
a50
a18 million training sentence pairs. The Arabic data
is romanized, some punctuation tokenization and some
number classing are carried out on the English and the
Arabic training data. As devtest set, we use testing
data provided by LDC, which consists of a60 a55a27a26 a18 sen-
tences with a53a54a51 a23a5a23a5a25 Arabic words with a26 reference trans-
lations. As a blind test set, we use MT 03 Arabic-English
DARPA evaluation test set consisting of a57a54a57
a18 sentences
with a60a57
a53
a24a28a23 Arabic words.
Three systems are evaluated in our experiments: a29
a55 is the
baseline block unigram model without re-ordering. Here,
monotone block alignments are generated: the blocks
a0
a16 have only left predecessors (no blocks are swapped).
This is the model presented in (Tillmann and Xia, 2003).
For the a29
a60 model, the sentence is translated mostly
monotonously, and only neighbor blocks are allowed to
be swapped (at most a60 block is skipped). The a29
a60a31a30a33a32 a29
model allows for the same block swapping as the a29
a60
model, but additionally uses the orientation component
described in Section 2: the block swapping is controlled
Table 1: Effect of the orientation model on Arabic-
English test data: LDC devtest set and DARPA MT 03
blind test set.
Test Unigram Setting BLEUr4n4
Model a1a11a13a12
a1
a11
a1 a1
a11
a11
a8
Dev test a29
a60
a50
a24
a26
a1
a50
a53
a57
a55
a50
a18
a26a5a26a3a2 a55
a50
a55
a60
a53
a29
a55
a50
a24a5a24
a1
a50
a53
a18
a55
a50
a18
a51a54a51a4a2 a55
a50
a55
a60 a18
a29
a60 a30 a32 a29
a50
a57a54a57
a1
a50
a53
a24
a1
a50
a55
a24
a55
a50
a18
a57a5a23
a2 a55
a50
a55
a60
a26
Test a29
a60
a50
a24
a26
a1
a50
a53
a57
a55
a50
a18a5a18
a57
a2 a55
a50
a55
a60
a24
a29
a55
a50
a24a5a24
a1
a50
a53
a18
a55
a50
a18a5a18
a25
a2 a55
a50
a55
a60
a57
a29
a60 a30 a32 a29
a50
a57a54a57
a1
a50
a53
a24
a1
a50
a55
a24
a55
a50
a18
a51
a57
a2 a55
a50
a55
a60
a24
Table 2: Arabic-English example blocks from the de-
vtest set: the Arabic phrases are romanized. The example
blocks were swapped in the development test set transla-
tions. The counts are obtained from the parallel training
data.
Arabic-English blocks a63a4a5a40a64a67a66a69a68 a63a4a6 a64a67a66a69a68
(’exhibition’ a7 ’mErD’) 97 32
(’added’ a7 ’wADAf’) 285 68
(’said’ a7’wqAl’) 872 801
(’suggested a7 ’AqtrH’) 356 729
(’terrorist attacks’ a7 hjmAt ArhAbyp’) 14 27
by the unigram orientation counts. The a29
a55 and
a29
a60 mod-
els use the block bigram model in Eq. 3: all blocks a0
are generated with neutral orientation a1
a5
a33
a30
a8 , and only
two components, the block unigram model
a18
a1
a0
a16
a8 and the
block bigram score
a18
a1
a0
a16a6a20
a0
a16a23a22
a1
a8 are used.
Experimental results are reported in Table 1: three BLEU
results are presented for both devtest set and blind test
set. Two scaling parameters are set on the devtest set and
copied for use on the blind test set. The second column
shows the model name, the third column presents the op-
timal weighting as obtained from the devtest set by car-
rying out an exhaustive grid search. The fourth column
shows BLEU results together with confidence intervals
(Here, the word casing is ignored). The block swapping
model a29
a60 a30 a32 a29 obtains a statistical significant improve-
ment over the baseline a29
a55 model. Interestingly, the swap-
ping model a29
a60 without orientation performs worse than
the baseline a29
a55 model: the word-based trigram language
model alone is too weak to control the block swapping:
the model is too unrestrictive to handle the block swap-
ping reliably. Additionally, Table 2 presents devtest set
example blocks that have actually been swapped. The
training data is unsegmented, as can be seen from the
first two blocks. The block in the first line has been seen
a18 times more often with left than with right orientation.
Blocks for which the ratio a8
a33a10a9a12a11a14a13a39a16a15
a9a12a17a18a13a39a16a15
is bigger thana55
a50
a53a54a51
are likely candidates for swapping in our Arabic-English
experiments. The ratio a8 itself is not currently used in the
orientation model. The orientation model mostly effects
blocks where the Arabic and English words are verbs or
nouns. As shown in Fig. 1, the orientation model uses
the orientation probability
a18
a41
a1
a0a8a11
a8 for the noun block
a0a10a11 ,
and only the default model for the adjective block a0 a1 . Al-
though the noun block might occur by itself without ad-
jective, the swapping is not controlled by the occurrence
of the adjective block a0a2a1 (which does not have adjacent
predecessors). We rather model the fact that a noun block
a0 is typically preceded by some block a0a5a9 . This situation
seems typical for the block swapping that occurs on the
evaluation test set.
Acknowledgment
This work was partially supported by DARPA and mon-
itored by SPAWAR under contract No. N66001-99-2-
8916. The paper has greatly profited from discussion with
Kishore Papineni and Fei Xia.
References
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della
Pietra, and Robert L. Mercer. 1993. The Mathematics
of Statistical Machine Translation: Parameter Estima-
tion. Computational Linguistics, 19(2):263–311.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical Phrase-Based Translation. In Proc.
of the HLT-NAACL 2003 conference, pages 127–133,
Edmonton, Canada, May.
Franz-Josef Och, Christoph Tillmann, and Hermann Ney.
1999. Improved Alignment Models for Statistical Ma-
chine Translation. In Proc. of the Joint Conf. on Em-
pirical Methods in Natural Language Processing and
Very Large Corpora (EMNLP/VLC 99), pages 20–28,
College Park, MD, June.
Christoph Tillmann and Fei Xia. 2003. A Phrase-based
Unigram Model for Statistical Machine Translation. In
Companian Vol. of the Joint HLT and NAACL Confer-
ence (HLT 03), pages 106–108, Edmonton, Canada,
June.
Ashish Venugopal, Stephan Vogel, and Alex Waibel.
2003. Effective Phrase Translation Extraction from
Alignment Models. In Proc. of the 41st Annual Conf.
of the Association for Computational Linguistics (ACL
03), pages 319–326, Sapporo, Japan, July.
Dekai Wu. 1996. A Polynomial-Time Algorithm for Sta-
tistical Machine Translation. In Proc. of the 34th An-
nual Conf. of the Association for Computational Lin-
guistics (ACL 96), pages 152–158, Santa Cruz, CA,
June.
Richard Zens and Hermann Ney. 2003. A Comparative
Study on Reordering Constraints in Statistical Machine
Translation. In Proc. of the 41st Annual Conf. of the
Association for Computational Linguistics (ACL 03),
pages 144–151, Sapporo, Japan, July.
