A Projection Extension Algorithm for Statistical Machine Translation
Christoph Tillmann
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598
ctill@us.ibm.com
Abstract
In this paper, we describe a phrase-based
unigram model for statistical machine
translation that uses a much simpler set
of model parameters than similar phrase-
based models. The units of translation are
blocks – pairs of phrases. During decod-
ing, we use a block unigram model and a
word-based trigram language model. Dur-
ing training, the blocks are learned from
source interval projections using an un-
derlying high-precision word alignment.
The system performance is significantly
increased by applying a novel block exten-
sion algorithm using an additional high-
recall word alignment. The blocks are fur-
ther filtered using unigram-count selection
criteria. The system has been successfully
test on a Chinese-English and an Arabic-
English translation task.
1 Introduction
Various papers use phrase-based translation systems
(Och et al., 1999; Marcu and Wong, 2002; Ya-
mada and Knight, 2002) that have shown to improve
translation quality over single-word based transla-
tion systems introduced in (Brown et al., 1993). In
this paper, we present a similar system with a much
simpler set of model parameters. Specifically, we
compute the probability of a block sequence a0a2a1a3 . A
block a0 is a pair consisting of a contiguous source
and a contiguous target phrase. The block sequence
Figure 1: A block sequence that jointly generates
a4 target and source phrases. The example is actual
decoder output and the English translation is slightly
incorrect.
probability a5a7a6a9a8 a0a10a1a3a12a11 is decomposed into conditional
probabilities using the chain rule:
a5a13a6a14a8
a0 a1a3a15a11a17a16
a1
a18
a19a21a20
a3
a5a7a6a9a8
a0
a19a23a22
a0
a19a25a24
a3
a11 (1)
a26
a1
a18
a19a21a20
a3a28a27a30a29
a8
a0
a19
a11a32a31
a27a34a33
a3
a24
a29a36a35
a8
a0
a19 a22
a0
a19a25a24
a3
a11
We try to find the block sequence that maximizes
a5a7a6a9a8
a0a37a1a3a12a11 : a0a37a1a3
a26a39a38a15a40a42a41a12a43a44a38a2a45a47a46a42a48a49
a5a13a6a14a8
a0a37a1a3a36a11 . The model pro-
posed is a joint model as in (Marcu and Wong,
2002), since target and source phrases are generated
jointly. The approach is illustrated in Figure 1. The
source phrases are given on the a50 -axis and the target
phrases are given on the a51 -axis. During block decod-
ing a bijection between source and target phrases is
generated. The two types of parameters in Eq 1 are
defined as:
a52 Block unigram model
a53a55a54a42a56a10a57a59a58 : We compute un-
igram probabilities for the blocks. The blocks
are simpler than the alignment templates (Och
et al., 1999) in that they do not have an internal
structure.
a52 Trigram language model: the probability
a53a55a54a42a56a60a57a62a61a56a37a57a25a63a65a64a10a58 between adjacent blocks is com-
puted as the probability of the first target word
in the target clump of a56a10a57 given the final two
words of the target clump of a56a28a57a25a63a65a64 .
The exponent a66 is set in informal experiments to be
a67a69a68a71a70 . No other parameters such as distortion proba-
bilities are used.
To select blocks a56 from training data, we compute
unigram block co-occurrence counts a72a73a54a42a56a28a58 . a72a73a54a42a56a74a58
cannot be computed for all blocks in the training
data: we would obtain hundreds of millions of
blocks. The blocks are restricted by an underlying
word alignment. In this paper, we present a block
generation algorithm similar to the one in (Och et
al., 1999) in full detail: source intervals are pro-
jected into target intervals under a restriction derived
from a high-precision word alignment. The projec-
tion yields a set of high-precision block links. These
block links are further extended using a high-recall
word alignment. The block extension algorithm is
shown to improve translation performance signifi-
cantly. The system is tested on a Chinese-English
(CE) and an Arabic-English (AE) translation task.
The paper is structured as follows: in Section 2,
we present the baseline block generation algorithm.
The block extension approach is described in Sec-
tion 2.1. Section 3 describes a DP-based decoder
using blocks. Experimental results are presented in
Section 4.
2 Block Generation Algorithm
Starting point for the block generation algorithm is
a word alignment obtained from an HMM Viterbi
training (Vogel et al., 1996). The HMM Viterbi
training is carried out twice with English as target
language and Chinese as source language and vice
versa. We obtain two alignment relations:
a75
a64a77a76 a78a79a54a25a80a32a81a62a82a84a83a79a58a28a61a82a86a85a36a80a88a87a90a89a92a91
a75a94a93
a76 a78a79a54a42a56a37a95a96a81a23a89a14a58a28a61a56a97a85a12a89a44a87a90a80a34a91
a82a98a85a47a80a99a87a100a89 is an alignment function from source to
target positions and a56a97a85a84a89a101a87a102a80 is an alignment func-
tion from target to source positions 1. We compute
the union and the intersection of the two alignment
relations a75 a64 and a75a103a93 :
a104
a76
a75
a64a106a105
a75a103a93
a107
a76
a75
a64a106a108
a75 a93
We call the intersection relation a104 , because it rep-
resents a high-precision alignment, and the union
alignment a107 , because it is taken to be a lower pre-
cision higher recall alignment (Och and Ney, 2000).
The intersection a104 is also a (partial) bijection be-
tween the target and source positions: it covers the
same number of target and source positions and
there is a bijection between source and target po-
sitions that are covered. For the CE experiments
reported in Section 4 about a109 a70 % of the target and
source positions are covered by word links in a104 , for
the AE experiments about a110 a70 % are covered. The ex-
tension algorithm presented assumes that a104a112a111a113a107 ,
which is valid in this case since a104 and a107 are de-
rived from intersection and union. We introduce the
following additional piece of notation:
a114a74a115a15a116
a54
a104
a58a17a76 a78a2a80a98a61a12a117a96a89 and a54a25a80a118a81a23a89a9a58a120a119
a104
a91 (2)
a114a74a115a15a116
a54
a104
a58 is the set of all source positions that are cov-
ered by some word links in a104 , where the source po-
sitions are shown along the a80 -axis and the target po-
sitions are shown along the a89 -axis. To derive high-
precision block links from the high-precision word
links, we use the following projection definitions:
a121
a83a9a54a23a122a80a30a123a25a81a23a80a92a124a25a58a17a76 a78a2a125a126a61a12a117a96a127a128a119a129a122a80a92a123a25a81a23a80a9a124 and a54a25a127a32a81a23a125a69a58a120a119
a104
a91
Here, a121 a83 a54a131a130a132a58 projects source intervals into target in-
tervals. a121 a95a96a54a131a130a132a58 projects target intervals into source in-
tervals and is defined accordingly. Starting from the
high-precision word alignment a104 , we try to derive a
high-precision block alignment: we project source
intervals a122a80 a123 a81a23a80a9a124 , where a80 a123a81a23a80a133a119 a114a74a115a15a116 a54a104 a58 . We compute
the minimum target index a89 a123 and maximum target
index a89 for the word links a53a134a119 a104 that fall into the
1a135 and
a136 denote a source positions. a137 and a138 denote a target
positions.
Source
Target
Source
Target
Figure 2: The left picture shows three blocks that
are learned from projecting three source intervals.
The right picture shows three blocks that cannot be
obtain from source interval projections .
Table 1: Block learning algorithm using the inter-
section a139 .
input: High-precision alignment a139
a140a142a141a132a143a145a144
for each interval a146a147a149a148a25a150a23a147a9a151 , where a147a30a148a152a150a23a147a154a153a126a155a74a156a15a157a23a158a25a139a160a159 do
a161
a148
a143 a162a44a163a165a164
a166a28a167a15a168a12a169a171a170a25a172a173a28a174a21a175a173a74a176a132a177a179a178a92a180
a161
a143 a162a44a181a183a182
a166a28a167a15a168a12a169a184a170a185a172a173a28a174a185a175a173a74a176a132a177a120a178a92a180
Extend block link a186 a143 a158a23a146a147a149a148a25a150a23a147a92a151a59a150a171a146a161 a148a152a150 a161 a151a25a159 ’out-
-wards’ using the algorithm in Table 2 and
add extended block link set to a140
output: Sentence block link set a140 .
interval a146a147 a148a150a23a147a92a151 . This way, we obtain a mapping of
source intervals into target intervals:
a146a147 a148 a150a23a147a92a151a188a187 a146
a162a101a163a21a164
a189a2a167a15a168 a169 a170a185a172a173a28a174a165a175a173a74a176a132a177
a161
a150
a162a44a181a183a182
a189a2a167a183a168 a169 a170a25a172a173a28a174a21a175a173a74a176a132a177
a161
a151 (3)
The approach is illustrated in Figure 2, where in the
left picture, for example, the source interval a146a191a190a15a150a60a192a184a151
is projected into the target interval a146a191a190a15a150a60a192a184a151 . The pair
a158a23a146a147a92a148a152a150a23a147a92a151a59a150a171a146
a161
a148a152a150
a161
a151a25a159 defines a block alignment link a186 . We
use this notation to emphasize that the identity of the
words is not used in the block learning algorithm. To
denote the block consisting of the target and source
words at the link positions, we write a193
a143a195a194
a158a152a186a69a159 :
a193
a143a196a194
a158a152a186a14a159
a143
a158
a194
a158a23a146a147 a148 a150a23a147a9a151a25a159a120a150
a194
a158a23a146
a161
a148 a150
a161
a151a25a159a197a159
a143
a158a23a158a42a198
a173 a174
a150a28a199a28a199a28a199a2a150a60a198
a173
a159a37a150a171a158a42a200
a189 a174
a150a28a199a28a199a28a199a171a150a60a200
a189
a159a23a159a37a150
where a198 a173 denote target words and a200 a189 denote source
words. a194 a158a131a199a132a159 denotes a function that maps intervals
Source
Target Target
Source Source
Target Target
Source
Figure 3: One, two, three, or four word links in a139
lie on the frontier of a block. Additional word links
may be inside the block.
to the words at these intervals. The algorithm for
generating the high-precision block alignment links
is given in Table 1. The order in which the source
intervals are generated does not change the final link
set.
2.1 Block Extension Algorithm
Empirically, we find that expanding the high-
precision block links significantly improves perfor-
mance. The expansion is parameterised and de-
scribed below. For a block link a186 a143 a158a23a146a147 a148 a150a23a147a9a151a59a150a171a146a161 a148 a150 a161 a151a25a159 ,
we compute its frontier a198a149a201a9a158a152a186a69a159 by looking at all word
links that lie on one of the four boundary lines of a
block. We make the following observation as shown
in Figure 3: the number of links (filled dots in the
picture) on the frontier a198a149a201a14a158a152a186a14a159 is less or equal a202 , since
in every column and row there is at most one link in
a139 , which is a partial bijetion. To learn blocks from a
general word alignment that is not a bijection more
than a202 word links may lie on the frontier of a block,
but to compute all possible blocks, it is sufficient to
look at all possible quadruples of word links. We
extend the links on the frontier by links of the high-
recall alignment a203 , where we use a parameterised
way of locally extending a given word link. We com-
pute an extended link set a204 by extending each word
link on the frontier separately and taking the union
of the resulting links. The way a word link is ex-
tended is illustrated in Figure 4. The filled dot in
the center of the picture is an element of the high-
precision set a139 . Starting from this link, we look for
l1l2
l3
a205a206a205
a205a206a205
a205a206a205
a205a206a205
a207a206a207
a207a206a207
a207a206a207
a207a206a207
a208a206a208
a208a206a208
a208a206a208
a208a206a208
a209a206a209
a209a206a209
a209a206a209
a209a206a209
a210a206a210a206a210
a210a206a210a206a210
a210a206a210a206a210
a210a206a210a206a210
a211a206a211a206a211
a211a206a211a206a211
a211a206a211a206a211
a211a206a211a206a211
−1
−2
+1
+2
0
delta=1
delta=2
0−1−2 +1 +2
Figure 4: Point extension scheme. Solid word links
lie in a212 , striped word links lie in a213 .
a214a92a214
a214a92a214
a214a92a214
a214a92a214
a215a92a215
a215a92a215
a215a92a215
a215a92a215
a216a92a216
a216a92a216
a216a92a216
a216a92a216
a217a92a217
a217a92a217
a217a92a217
a217a92a217
a218a92a218
a218a92a218
a218a92a218
a218a92a218
a219a92a219
a219a92a219
a219a92a219
a219a92a219
a220a92a220
a220a92a220
a220a92a220
a220a92a220
a221a92a221
a221a92a221
a221a92a221
a221a92a221Target
Source
Figure 5: ’Outward’ extension of a high-precision
block link.
extensions in its neighborhood that lie in a213 , where
the neighborhood is defined by a cell width parame-
ter a222 and a distance parameter a223 . For instance, link
a224a42a225 in Figure 4 is reached with cell width
a222a195a226a228a227 and
distance a223a7a226a229a227 , the link a224a25a230 is reached with a222a231a226a229a227 and
a223a160a226a229a232 , the link a224a185a233 is reached with a222a234a226a235a232 and a223a94a226a229a236 .
The word link a224 is added to a237 and it is itself extended
using the same scheme. Here, we never make use
of a row or a column covered by a212 other than the
rows a238 and a238a96a239 and the columns a240 and a240a47a239 . Also, we
do not cross such a row or column using an exten-
sion with a223a94a241a234a232 : this way only a small fraction of the
word links in a213 is used for extending a single block
link. The extensions are carried out iteratively until
no new alignment links from a213 are added to a237 . The
block extension algorithm in Table 2 uses the exten-
sion set a237 to generate all word link quadruples: the
extended block a242 that is defined by a given quadru-
Table 2: Block link extension algorithm. The a243a88a244a59a245
and a243a126a246a79a240 function compute the minimum and the
maximum of a247 integer values.
input: Block link a246a44a226a249a248a23a250a240a149a239a152a251a23a240a9a252a59a251a171a250a238a96a239a152a251a23a238a84a252a25a253
a254a142a255
a226a1a0
Compute extension set a237 from frontier a2a4a3a9a248a152a246a14a253
for each a248a25a240
a225
a251a23a238
a225
a253a37a251a171a248a25a240
a230
a251a23a238
a230
a253a37a251a171a248a25a240
a233
a251a23a238
a233
a253a37a251a171a248a25a240a6a5a12a251a23a238a7a5a184a253a9a8a88a237
a10
a239 a226 a11a13a12a15a14
a225a17a16a19a18a20a16
a5
a240
a18a22a21
a10
a226a23a11a25a24a27a26
a225a17a16a19a18a20a16
a5
a240
a18
a28
a239a92a226 a11a13a12a15a14
a225a17a16a19a18a20a16
a5
a238
a18a29a21
a28
a226a23a11a13a24a27a26
a225a17a16a19a18a20a16
a5
a238
a18
if ( a246a31a30a234a242 a226a249a248a23a250a10 a239 a251 a10 a252a59a251a171a250a28 a239 a251 a28 a252a25a253a23a253 a254a142a255a226 a254a33a32 a242
output: Extended block link set a254 .
ple is generated and a check is carried out whether a242
includes the seed block link a246 . The following defi-
nition for block link inclusion is used:
a246 a239 a30 a246
a255
a226 a250
a10
a239a251
a10
a252a34a30a134a250a240 a239a251a23a240a92a252 and a250
a28
a239a251
a28
a252a34a30a134a250a238 a239a251a23a238a79a252a59a251
where the block a246a96a239a97a226a39a248a23a250a10 a239a152a251 a10 a252a59a251a171a250a28 a239a42a251 a28 a252a25a253 is said to be
included in a246a249a226 a248a23a250a240 a239 a251a23a240a9a252a59a251a171a250a238 a239 a251a23a238a84a252a25a253 . a250a28 a239 a251 a28 a252a35a30a77a250a238 a239a251a23a238a79a252
holds iff a28 a239 a241a229a238a96a239 and a28a37a36 a238 . The ’seed’ block link
a246 is extended ’outwardly’: all extended blocks a242 in-
clude the high-precision block a246 . The block link a246
itself may be included in other high-precision block
links a246 a239 on its part, but a246a38a30a228a242a39a30a142a246 a239 holds. An ex-
tended block a242 derived from the block a246 never vio-
lates the projection restriction relative to a212 i.e., we
do not have to re-check the projection restriction for
any generated block, which simplifies and fastens up
the generation algorithm. The approach is illustrated
in Figure 5, where a high-precision block with a236 ele-
ments on its frontier is extended by two blocks con-
taining it.
The block link extension algorithm produces block
links that contain new source and target intervals
a250a240 a239 a251a23a240a92a252 and a250a238 a239a251a23a238a79a252 that extend the interval mapping
in Eq. 3. This mapping is no longer a function, but
rather a relation between source and target intervals
i.e., a single source interval is mapped to several tar-
get intervals and vice versa. The extended block set
constitutes a subset of the following set of interval
pairs:
a40a22a41a17a42a43a6a44a46a45a17a43a19a47a48a45a49a42a50a51a44a52a45a17a50a51a47a46a53a38a54a56a55a34a57a58a41a17a42a43a19a44a59a45a17a43a58a47a46a53a61a60a62a42a50a51a44a46a45a17a50a51a47a64a63
The set of high-precision blocks is contained in this
set. We cannot use the entire set of blocks defined
by all pairs in the above relation, the resulting set
of blocks cannot be handled due to memory restric-
tions, which motivates our extension algorithm. We
also tried the following symmetric restriction and
tested the resulting block set:
a55a65a57a19a41a17a42a43 a44 a45a17a43a19a47a46a53a61a60a62a42a50a4a45a17a50a51a47 and a55a34a66a67a41a17a42a50 a44 a45a17a50a51a47a46a53a61a60a62a42a43 a44 a45a17a43a19a47 (4)
The modified restriction is implemented in the con-
text of the extension scheme in Table 1 by insert-
ing an if statement before the alignment link a68 is
extended: the alignment link is extended only if the
restriction a55a69a66a67a41a17a42a50 a44 a45a17a50a70a47a46a53a71a60a72a42a43 a44 a45a17a43a58a47 also holds.
Considering only block links for which the two way
projection in Eq. 4 holds has the following inter-
esting interpretation: assuming a bijection a73 that is
complete i.e., all source and target positions are cov-
ered, an efficient block segmentation algorithm ex-
ists to compute a Viterbi block alignment as in Fig-
ure 1 for a given training sentence pair. The com-
plexity of the algorithm is quadratic in the length
of the source sentence. This dynamic programming
technique is not used in the current block selection
but might be used in future work.
2.2 Unigram Block Selection
For selecting blocks from the candidate block links,
we restrict ourselves to block links where target and
source phrases are equal or less than a74 words long.
This way we obtain some tens of millions of blocks
on our training data including blocks that occur only
once. This baseline set is further filtered using the
unigram count a75
a41a52a76a77a53 :
a75a79a78 denotes the set of blocks
a76 for which
a75
a41a52a76a77a53a81a80
a78 . For our Chinese-English
experiments, we use the a75a79a82 restriction as our base-
line, and for the Arabic-English experiments the a75a79a83
restriction. Blocks where the target and the source
clump are of length a84 are kept regardless of their
count2. We compute the unigram probability a85 a41a52a76a77a53
2To apply the restrictions exhaustively, we have imple-
mented tree-based data structures to store up to a86a88a87 million
blocks with phrases of up to length a89 in less than a90 gigabyte
of RAM.
Figure 6: An example of a91 recursively nested blocks
a76a49a92a77a45a93a76a95a94a96a45a93a76a98a97a96a45a93a76a95a99 .
as relative frequency over all selected blocks.
An example of a91 blocks obtained from the Chinese-
English training data is shown in Figure 6. ’$DATE’
is a placeholder for a date expression. Block a76 a99 con-
tains the blocks a76 a92 to a76 a97 . All a91 blocks are selected
in training: the unigram decoder prefers a76a77a99 even if
a76 a92 ,a76 a94 , and a76 a97 are much more frequent. The solid
word links are word links in a73 , the striped word
links are word links in a100 . Using the links in a100 , we
can learn one-to-many block translations, e.g. the
pair (a101 a92 ,’Xinhua news agency’) is learned from the
training data.
3 DP-based Decoder
We use a DP-based beam search procedure similar
to the one presented in (Tillmann and Ney, 2003).
We maximize over all block segmentations a76a77a102a92 for
which the source phrases yield a segmentation of the
input source sentence, generating the target sentence
simultaneously. The decoder processes search states
of the following form:
a103 a104 a105a107a106
a44a59a45
a106
a45a109a108a69a45a65a110a111a45a113a112a46a44a46a45a114a112a109a115a1a116
a106 and a106
a44 are the two predecessor words used for the
trigram language model, a108 is the so-called cover-
age vector to keep track of the already processed
source position, a110 is the last processed source po-
sition. a112 is the source phrase length of the block
Table 3: Effect of the extension scheme a117a61a118a58a119a120 on the
CE translation experiments.
Scheme # blocks # blocks BLEUr4n4
a121a123a122a52a124a88a125a61a126a33a127 a121a37a122a52a124a77a125a71a126a129a128
a130a31a131
a119
a131 41.88 M 6.53 M 0.148
a132 0.01
a117
a131
a119
a131 14.77 M 2.67 M 0.160
a132 0.01
a117a134a133 a119a133 24.47 M 4.50 M 0.180 a132 0.01
a117a134a133 a119a135 35.23 M 6.18 M 0.183 a132 0.01
a117 a135a98a119a135 37.92 M 6.65 M 0.183 a132 0.01
a117 a135a98a119a136 45.81 M 7.66 M 0.181 a132 0.01
currently being matched. a137a48a138 is the length of the ini-
tial fragment of the source phrase that has been pro-
cessed so far. a137 a138 is smaller or equal a137 : a137 a138a140a139 a137 . Note,
that the partial hypotheses are not distinguished ac-
cording to the identity of the block itself. The de-
coder processes the input sentence ’cardinality syn-
chronously’: all partial hypotheses that are active at
a given point cover the same number of input sen-
tence words. The same beam-search pruning as de-
scribed in (Tillmann and Ney, 2003) is used. The
so-called observation pruning threshold is modified
as follows: for each source interval that is being
matched by a block source phrase at most the best a127a141a128
target phrases according to the joint unigram proba-
bility are hypothesized. The list of blocks that cor-
respond to a matched source interval is stored in a
chart for each input sentence. This way the match-
ing is carried out only once for all partial hypotheses
that try to match the same input sentence interval.
In the current experiments, decoding without block
re-ordering yields the best translation results. The
decoder translates about a127a141a142a96a143 words per second.
4 Experimental Results
4.1 Chinese-English Experiments
The translation system is tested on a Chinese-to-
English translation task. For testing, we use the
DARPA/NIST MT 2001 dry-run testing data, which
consists of a144a96a145a7a146 sentences with a128a96a143 a146a7a146a7a146 words ar-
ranged in a142a96a143 documents 3. The training data is pro-
vided by the LDC and labeled by NIST as the Large
Data condition for the MT 2002 evaluation. The
3We removed the first
a147a95a148 documents that are contained in
the training data.
Table 4: Effect of the unigram threshold on the
BLEU score. The maximum phrase length is a142 .
Selection # blocks BLEUr4n4
Restriction selected
N2 6.18 M 0.183 a132 0.01
N3 1.69 M 0.185 a132 0.01
N5 0.85 M 0.178 a132 0.01
N10 0.45 M 0.176 a132 0.01
N25 0.26 M 0.166 a132 0.01
N100 0.18 M 0.154 a132 0.01
Chinese sentences are segmented into words. The
training data contains a128 a146a67a149a150a144 million Chinese and a128a7a151 a149a152a146
million English words. The block selection algo-
rithm described below runs less than one hour on
a single a127 -Gigahertz linux machine.
Table 3 presents results for various block extension
schemes. The first column describes the extension
scheme used. The second column reports the total
number of blocks in millions collected - including
all the blocks that occurred only once. The third
column reports the number of blocks that occurred
at least twice. These blocks are used to compute the
results in the fourth column: the BLEU score (Pa-
pineni et al., 2002) with a153 reference translation us-
ing a153 -grams along with 95% confidence interval is
reported 4. Line a127 and line a128 of this table show re-
sults where only the source interval projection with-
out any extension is carried out. For the a130a35a131 a119a131 ex-
tension scheme, the high-recall union set itself is
used for projection. The results are worse than for
all other schemes, since a lot of smaller blocks are
discarded due to the projection approach. The a117 a131 a119a131
scheme, where just the a117 word links are used is too
restrictive leaving out bigger blocks that are admis-
sible according to a117 . For the Chinese-English test
data, there is only a minor difference between the
different extension schemes, the best results are ob-
tained for the a117a154a133 a119a133 and the a117a155a133 a119a135 extension schemes.
Table 4 shows the effect of the unigram selection
threshold, where the a117a56a133 a119a135 blocks are used. The sec-
ond column shows the number of blocks selected.
The best results are obtained for the a121a39a128 and the a121 a146
4The test data is split into a certain number of subsets. The
BLEU score is computed on each subset. We use the t-test to
compare these scores.
sets. The number of blocks can be reduced dras-
tically where the translation performance declines
only gradually.
Table 5 shows the effect of the maximum phrase
length on the BLEU score for the a156a79a157 block set. In-
cluding blocks with longer phrases actually helps to
improve performance, although already a length of
a158 obtains nearly identical results.
We carried out the following control experiments
(using a156a37a159a52a160a77a161a61a162a163a157 as threshold): we obtained a block
set of a164a67a165a152a166a168a167 million blocks by generating blocks from
all quadruples of word links in a169 5. This set is a
proper superset of the blocks learned for the a169a155a170a98a171a170
experiment in Table 3. The resulting BLEU score
is a172a168a165a15a167a141a173a111a174 . Including additional smaller blocks even
hurts translation performance in this case. Also, for
the extension scheme a169a154a175 a171a176 , we carried out the in-
verse projection as described in Section 2.1 to obtain
a block set of a157a67a165a152a173a7a177 million blocks and a BLEU score
of a172a168a165a15a167a141a178a7a173 . This number is smaller than the BLEU
score of a172a168a165a15a167a141a177a7a164 for the a169 a175 a171a176 restriction: for the trans-
lation direction Chinese-to-English, selecting blocks
with longer English phrases seems to be important
for good translation performance. It is interesting
to note, that the unigram translation model is sym-
metric: the translation direction can be switched to
English-to-Chinese without re-training the model -
just a new Chinese language model is needed. Our
experiments, though, show that there is an unbalance
with respect to the projection direction that has a sig-
nificant influence on the translation results. Finally,
we carried out an experiment where we used the a169a155a170a98a171a170
block set as a baseline. The extension algorithm was
applied only to blocks of target and source length a167
producing one-to-many translations, e.g. the blocks
a160a49a175 and a160 a176 in Figure 6. The BLEU score improved
to a172a168a165a15a167a141a178a7a178 with a block set of a164a67a165a15a167a77a172 million blocks. It
seems to be important to carry out the block exten-
sion also for larger blocks.
We also ran the N2 system on the June 2002 DARPA
TIDES Large Data evaluation test set. Six re-
search sites and four commercial off-the-shelf sys-
tems were evaluated in Large Data track. A major-
ity of the systems were phrase-based translation sys-
tems. For comparison with other sites, we quote the
5We cannot compute the block set resulting from all word
link quadruples in a179 , which is much bigger, due to CPU and
memory restrictions.
Table 5: Effect of the maximum phrase length on
the BLEU score. Both target and source phrase are
shorted than the maximum. The unigram threshold
is a156a123a159a52a160a77a161a71a162a180a157 .
maximum # blocks BLEUr4n4
phrase length selected
8 6.18 M 0.183 a181 0.01
7 5.60 M 0.182 a181 0.01
6 4.97 M 0.182 a181 0.01
5 4.25 M 0.179 a181 0.01
4 3.40 M 0.178 a181 0.01
3 2.34 M 0.167 a181 0.01
2 1.07 M 0.150 a181 0.01
1 0.16 M 0.118 a181 0.01
Table 6: Effect of the extension scheme a169a61a182 a171a183 on the
AE translation experiments.
Scheme # blocks # blocks BLEUr3n4
a156a123a159a52a160a88a161a61a162a33a167 a156a37a159a52a160a77a161a71a162a129a164
a169 a170a98a171a170 79.0 M 6.79 M 0.209 a181 0.03
a169 a175 a171a175 96.6 M 8.29 M 0.223 a181 0.03
a169 a175 a171a176 113.16 M 9.87 M 0.232 a181 0.03
NIST score (Doddington, 2002) on this test set: the
N2 system scores 7.56 whereas the official top two
systems scored 7.65 and 7.34 respectively.
4.2 Arabic-English Experiments
We also carried out experiments for the translation
direction Arabic to English using training data from
UN documents. For testing, we use a test set of a167a49a174a7a174
sentences with a173a67a184a93a177a7a157a7a178 words arranged in a167a141a166 docu-
ments The training data contains a167a141a157a7a157a67a165a185a172 million Ara-
bic and a166a7a173a67a165a152a177 million English words. The train-
ing data is pre-processed using some morphologi-
cal analysis. For the Arabic experiments, we have
tested the a164 extension schemes a169a186a170a98a171a170 , a169 a175 a171a175 , and a169 a175 a171a176
as shown in Table 6. Here, the results for the differ-
ent schemes differ significantly and the a169a187a175 a171a176 scheme
produces the best results. For the AE experiments,
only blocks up to a phrase length of a178 are computed
due to disk memory restrictions. The training data
is split into several chunks of a164a96a172a7a172a168a184a188a172a7a172a7a172 training sen-
tence pairs each, and the final block set together with
the unigram count is obtained by merging the block
files for each of the chunks written onto disk mem-
ory. The word-to-word alignment is trained using
a189 iterations of the IBM Model
a190 training followed
by a189 iterations of the HMM Viterbi training. This
training procedure takes about a day to execute on a
single machine. Additionally, the overall block se-
lection procedure takes about a191a67a192a189 hours to execute.
5 Previous Work
Block-based translation units are used in several pa-
pers on statistical machine translation. (Och et al.,
1999) describe the alignment template system for
statistical MT: alignment templates correspond to
blocks that do have an internal structure. Marcu
and Wong (2002) use a joint probability model for
blocks where the clumps are contiguous phrases as
in this paper. Yamada and Knight (2002) presents
a decoder for syntax-based MT that uses so-called
phrasal translation units that correspond to blocks.
Block unigram counts are used to filter the blocks.
The phrasal model is included into a syntax-based
model. Projection of phrases has also been used in
(Yarowsky et al., 2001). A word link extension al-
gorithm similar to the one presented in this paper is
given in (Koehn et al., 2003).
6 Conclusion
In this paper, we describe a block-based unigram
model for SMT. A novel block learning algorithm is
presented that extends high-precision interval pro-
jections by elements from a high-recall alignment.
The extension method is shown to improve transla-
tion performance significantly. For the Chinese-to-
English task, we obtained a NIST score of a193a51a192a189a7a194 on
the June 2002 DARPA TIDES Large Data evalua-
tion test set.
Acknowledgements
This work was partially supported by DARPA and
monitored by SPAWAR under contract No. N66001-
99-2-8916. The paper has greatly profited from dis-
cussion with Fei Xia and Kishore Papineni.

References
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della
Pietra, and Robert L. Mercer. 1993. The Mathematics
of Statistical Machine Translation: Parameter Estima-
tion. Computational Linguistics, 19(2):263–311.
George Doddington. 2002. Automatic evaluation of ma-
chine translation quality using n-gram co-occurrence
statistics. In Proc. of the Second International Confer-
ence of Human Language Technology Research, pages
138–145, March.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical Phrase-Based Translation. In Proc.
of the HLT-NAACL 2003 conference, pages 127–133,
Edmonton, Alberta, Canada, May.
Daniel Marcu and William Wong. 2002. A Phrased-
Based, Joint Probability Model for Statistical Machine
Translation. In Proc. of the Conf. on Empirical Meth-
ods in Natural Language Processing (EMNLP 02),
pages 133–139, Philadelphia, PA, July.
Franz-Josef Och and Hermann Ney. 2000. Improved Sta-
tistical Alignment Models. In Proc. of the 38th Annual
Meeting of the Association of Computational Linguis-
tics (ACL 2000), pages 440–447, Hong-Kong, China,
October.
Franz-Josef Och, Christoph Tillmann, and Hermann Ney.
1999. Improved Alignment Models for Statistical Ma-
chine Translation. In Proc. of the Joint Conf. on Em-
pirical Methods in Natural Language Processing and
Very Large Corpora (EMNLP/VLC 99), pages 20–28,
College Park, MD, June.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of machine translation. In Proc. of the
40th Annual Conf. of the Association for Computa-
tional Linguistics (ACL 02), pages 311–318, Philadel-
phia, PA, July.
Christoph Tillmann and Hermann Ney. 2003. Word Re-
ordering and a DP Beam Search Algorithm for Statis-
tical Machine Translation. Computational Linguistics,
29(1):97–133.
Stefan Vogel, Hermann Ney, and Christoph Tillmann.
1996. HMM BasedWord Alignment in Statistical Ma-
chine Translation. In Proc. of the 16th Int. Conf.
on Computational Linguistics (COLING 1996), pages
836–841, Copenhagen, Denmark, August.
Kenji Yamada and Kevin Knight. 2002. A Decoder for
Syntax-based Statistical MT. In Proc. of the 40th An-
nual Conf. of the Association for Computational Lin-
guistics (ACL 02), pages 303–310, Philadelphia, PA,
July.
David Yarowsky, Grace Ngai, and Richard Wicentowski.
2001. Inducing Multilingual Text Analysis tools via
Robust Projection across Aligned Corpora. In Proc. of
the HLT 2001 conference, pages 161–168, San Diego,
CA, March.
