Workshop on Computationally Hard Problemsand Joint Inference in Speech and Language Processing, pages 9–16,
New York City, New York, June 2006. c©2006 Association for Computational Linguistics
Efficient Dynamic Programming Search Algorithms for Phrase-Based SMT
Christoph Tillmann
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598
ctill@us.ibm.com
Abstract
This paper presents a series of efficient
dynamic-programming (DP) based algorithms
for phrase-based decoding and alignment
computation in statistical machine translation
(SMT). The DP-based decoding algorithms are
analyzed in terms of shortest path-finding al-
gorithms, where the similarity to DP-based
decoding algorithms in speech recognition is
demonstrated. The paper contains the follow-
ing original contributions: 1) the DP-based de-
coding algorithm in (Tillmann and Ney, 2003)
is extended in a formal way to handle phrases
and a novel pruning strategy with increased
translation speed is presented 2) a novel align-
ment algorithm is presented that computes a
phrase alignment efficiently in the case that it
is consistent with an underlying word align-
ment. Under certain restrictions, both algo-
rithms handle MT-related problems efficiently
that are generally NP complete (Knight, 1999).
1 Introduction
This paper deals with dynamic programming based de-
coding and alignment algorithms for phrase-based SMT.
Dynamic Programming based search algorithms are be-
ing used in speech recognition (Jelinek, 1998; Ney et
al., 1992) as well as in statistical machine translation
(Tillmann et al., 1997; Niessen et al., 1998; Tillmann
and Ney, 2003). Here, the decoding algorithms are de-
scribed as shortest path finding algorithms in regularly
structured search graphs or search grids. Under certain
restrictions, e.g. start and end point restrictions for the
path, the shortest path computed corresponds to a rec-
ognized word sequence or a generated target language
translation. In these algorithms, a shortest-path search
({1},1)
({1,3},3)
({1,2},2)
({1,4},4)
({1,5},5)
({1,2,4},4)
({1,2,3},3)
({1,2,5},5)
({1,3,4},4)
({1,2,3},2)
({1,3,5},5)
({1,3,4},3)
({1,2,4},2)
({1,4,5},5)
({1,3,5},3)
({1,2,5},2)
({1,4,5},4)
({1,2,3,5},5)
({1,2,4,5},5)
({1,3,4,5},5)
({1,2,3,4},4)
({1,2,4,5},4)
({1,3,4,5},4)
({1,2,3,4},3)
({1,2,3,5},3)
({1,3,4,5},3)
({1,2,3,4},2)
({1,2,3,5},2)
({1,2,4,5},2)
({1,2,3,4,5},2)
({1,2,3,4,5},3)
({1,2,3,4,5},4)
({1,2,3,4,5},5)
Final
Figure 1: Illustration of a DP-based algorithm to solve
a traveling salesman problem with a0 cities. The visited
cities correspond to processed source positions.
is carried out in one pass over some input along a spe-
cific ’direction’: in speech recognition the search is time-
synchronous, the single-word based search algorithm in
(Tillmann et al., 1997) is (source) position-synchronous
or left-to-right, the search algorithm in (Niessen et al.,
1998) is (target) position-synchronous or bottom-to-top,
and the search algorithm in (Tillmann and Ney, 2003) is
so-called cardinality-synchronous.
Taking into account the different word order between
source and target language sentences, it becomes less ob-
vious that a SMT search algorithm can be described as a
shortest path finding algorithm. But this has been shown
by linking decoding to a dynamic-programming solution
for the traveling salesman problem. This algorithm due
to (Held and Karp, 1962) is a special case of a shortest
path finding algorithm (Dreyfus and Law, 1977). The
regularly structured search graph for this problem is il-
lustrated in Fig. 1: all paths from the left-most to the
right-most vertex correspond to a translation of the in-
9
put sentence, where each source position is processed ex-
actly once. In this paper, the DP-based search algorithm
in (Tillmann and Ney, 2003) is extended in a formal way
to handle phrase-based translation. Two versions of a
phrase-based decoder for SMT that search slightly dif-
ferent search graphs are presented: a multi-beam decoder
reported in the literature and a single-beam decoder with
increased translation speed 1. A common analysis of all
the search algorithms above in terms of a shortest-path
finding algorithm for a directed acyclic graph (dag) is
presented. This analysis provides a simple way of ana-
lyzing the complexity of DP-based search algorithm.
Generally, the regular search space can only be fully
searched for small search grids under appropriate restric-
tions, i.e. the monotonicity restrictions in (Tillmann et
al., 1997) or the inverted search graph in (Niessen et al.,
1998). For larger search spaces as are required for con-
tinuous speech recognition (Ney et al., 1992) 2 or phrase-
based decoding in SMT, the search space cannot be fully
searched: suitably defined lists of path hypothesis are
maintained that partially explore the search space. The
number of hypotheses depends locally on the number hy-
potheses whose score is close to the top scoring hypothe-
sis: this set of hypotheses is called the beam.
The translation model used in this paper is a phrase-
based model, where the translation units are so-called
blocks: a block is a pair of phrases which are transla-
tions of each other. For example, Fig. 2 shows an Arabic-
English translation example that uses a0 blocks. During
decoding, we view translation as a block segmentation
process, where the input sentence is segmented from left
to right and the target sentence is generated from bottom
to top, one block at a time. In practice, a largely mono-
tone block sequence is generated except for the possibil-
ity to swap some neighbor blocks. During decoding, we
try to minimize the score a1a3a2a5a4a7a6a9a8a10a12a11 of a block sequence a6a9a8a10
under the restriction that the concatenated source phrases
of the blocks a6a14a13 yield a segmentation of the input sen-
tence:
a1 a2 a4a7a6 a8a10 a11a16a15
a8
a13a18a17
a10
a19
a4a20a6 a13a22a21
a10a24a23
a6 a13 a11a25a15
a8
a13a26a17
a10
a27a29a28a31a30a33a32
a4a20a6 a13a34a21
a10a24a23
a6 a13 a11a20a35 (1)
Here, a32 a4a7a6a9a13a34a21
a10 a23
a6a9a13a9a11 is a36 -dimensional feature vector with
real-valued features and a27 is the corresponding weight
vector as described in Section 5. The fact that a given
block covers some source interval a37a38a25a39 a23 a38a25a40 is implicit in this
notation.
1The multi-beam decoder is similar to the decoder presented
in (Koehn, 2004) which is a standard decoder used in phrase-
based SMT. A multi-beam decoder is also used in (Al-Onaizan
et al., 2004) and (Berger et al., 1996).
2In that work, there is a distinction between within-word and
between-word search, which is not relevant for phrase-based
decoding where only exact phrase matches are searched.
a41a25a42
a43a45a44a25a46a45a47a25a48a45a44a25a49a50a44
a51a12a52a22a53a55a54
a47a25a56a33a44
a57
a47a25a58a60a59
a54
a47a25a48a45a44a25a49
a61a60a49a50a58a60a47a25a44
a54a22a52
a62
a54
a63
a62a26a64
a58
a62
a56
a64
a62
a54
a65
a58
a46a66
a67
a62
a54
a62
a49
a58
a62
a66
a54
a66
a67
a56
a48
a56a68
a69
a47
a52
a58a60a49a50a59a45a47a25a70a3a44
a54
a54
a46
a48
a62
a48
a66
a62
a54
a71a72
a62
a54
a62
a54a72
a57a66
a62
a41a25a73
a41a25a74
a41a25a75
a41a77a76
Figure 2: An Arabic-English block translation example,
where the Arabic words are romanized. A sequence of a0
blocks is generated.
This paper is structured as follows: Section 2 intro-
duces the multi-beam and the single-beam DP-based de-
coders. Section 3 presents an analysis of all the graph-
based shortest-path finding algorithm mentioned above:
a search algorithm for a directed acyclic graph (dag).
Section 4 shows an efficient phrasal alignment algorithm
that gives an algorithmic justification for learning blocks
from word-aligned training. Finally, Section 5 presents
an evaluation of the beam-search decoders on an Arabic-
English decoding task.
2 Beam-Search Decoding Algorithms
In this section, we introduce two beam-search algorithms
for SMT: a multi-beam algorithm and single-beam algo-
rithm. The multi-beam search algorithm is presented first,
since it is conceptually simpler.
2.1 Multi-Beam Decoder
For the multi-beam decoder makes use of search states
that are a78 -tuples of the following type:
a37a80a79
a23a82a81a84a83a86a85
a40a87a35 (2)
a81 is the state history, that depends on the block generation
model. In our case, a81 a15a88a4a9a37a38 a23 a38a89a39a90a40 a23 a37a91 a23a20a92 a40a7a11 , where a4a9a37a38 a23 a38a89a39a90a40a7a11 is
the interval where the most recent block matched the in-
put sentence, and a37a91 a23a7a92 a40 are the final two target words of
the partial translation produced thus far. a79 is the so-called
coverage vector that ensures that a consistent block align-
ment is obtained during decoding and that the decoding
10
Table 1: Multi-beam (a93 -Beam) decoding algorithm,
which is similar to (Koehn, 2004). The decoders differ in
their pruning strategy: here, each state list a94a86a95 is pruned
only once, whereas the decoder in (Koehn, 2004) prunes
a state list every time a new hypothesis is entered.
input: source sentence with words a32 a10 a23 a30a50a30a50a30 a23 a32a97a96
a94a99a98a101a100a102a15a104a103a25a105a87a98a87a106 and a94a86a107a108a100a102a15a110a109 for a111a112a15a114a113
a23
a30a50a30a50a30
a23a50a115
for each a19 a15a117a116
a23
a113
a23
a30a97a30a50a30
a23a97a115 do
Prune state set a94a118a95
for each state a105 in a94a118a95 do
matcher: for each a105a12a39a118a100a119a105a31a120a122a121a123a105a12a39
update a105a12a39 for a94 a95a34a124a99a125a127a126a26a128a25a129a26a130
end
end
output: translation from lowest cost state in a94 a96
can be carried out efficiently. It keeps track of the already
processed input sentence positions. a85 is the cost of the
shortest path (distance) from some initial state a105 a98 to the
current state a105 . The baseline decoder maintains a115a132a131 a113
state lists with entries of the above type, where a115 is the
number of input words. The states are stored in lists or
stacks that support lookup operations to check whether a
given state tuple is already present in a list and what its
score a85 is.
The use of a coverage vector a79 is related to a DP-based
solution for the traveling salesman problem as illustrated
in Fig. 1. The algorithm keeps track of sets of visited
cities along with the identity of the last visited city. Cities
correspond to source sentence positions a38 . The vertexes
in this graph correspond to set of already visited cities.
Since the traveling salesman problem (and also the trans-
lation model) uses only local costs, the order in which
the source positions have been processed can be ignored.
Conceptually, the re-ordering problem is linearized by
searching a path through the set inclusion graph in Fig. 1.
Phrase-based decoding is handle by an almost identical
algorithm: the last visited position a38 is replaced by an
interval a37a38a25a39
a23
a38a25a40 .
The states are stored in lists or stacks that support
lookup operations to check whether a given state tuple is
already present in a list and what its score a85 is. Extending
the partial block translation that is represented by a state
a105 with a single block a6a20a39 generates a new state a105a12a39 . Here,
a37a18a111
a23
a111a25a39a133a40 is the source interval where block a6a20a39 matches the
input sentence. The state transition is defined as follows:
a37a79
a23a82a81a134a83a118a85
a40a135a120a136a121 a37a79 a39
a23a82a81 a39 a83a86a85 a39
a40a55a35 (3)
The a105a12a39 state fields are updated on a component-by-
component basis. a79a45a39a137a15a138a79a136a139a136a37a18a111
a23
a111a25a39a102a40 is the coverage vec-
Table 2: Single-beam (a140 -Beam) decoding algorithm (re-
lated to (Lowerre and Reddy, 1980)).
input: source sentence with words a32 a10 a23 a30a97a30a50a30 a23 a32a97a96
a94a132a100a102a15a104a103a25a105a87a98a87a106
for each a19 a15a141a116 a23 a113 a23 a30a50a30a97a30 a23a50a115 do
a94a142a39a99a100a133a15a143a103a87a109a77a106
for each state a105 in a94 do
if CLOSED?a4a7a105a142a11 then
matcher: for each a105a12a39a118a100a119a105a31a120a122a121a144a105a12a39
else
scanner: for single a105a12a39a118a100a119a105a31a120a122a145a31a105a12a39
update a105a12a39 for a94a142a39
end
Prune state set a94 a39
Swap a94 , a94a99a39
end
end
output: translation from lowest cost state in a94
tor obtained by adding all the positions from the inter-
val a37a18a111 a23 a111a77a39a102a40 . The new state history is defined as a81 a39a132a15
a4a60a37a18a111
a23
a111a25a39a102a40
a23
a37a91a45a39
a23a20a92
a39a133a40a146a11 where a91a147a39 and
a92
a39 are the final two tar-
get words of the target phrase a148a149a39 of a6a20a39 . Some special
cases, e.g. where a148a149a39 has less than two target words, are
taken into account. The path cost a85 a39 is computed as a85 a39a82a15
a85a150a131a151a85
a4a7a105
a23
a105a12a39a20a11 , where the transition cost
a85
a4a7a105
a23
a105a12a39a60a11a82a100a133a15
a19
a4a7a6
a23
a6a20a39a7a11
is computed from the history a81 and the matching block a6a20a39
as defined in Section 5.
The decoder in Table 1 fills a115a152a131 a113 state sets a94a118a107a153a100a101a111a154a15
a103a3a116
a23
a30a97a30a50a30
a23a97a115
a106 . All the coverage vectorsa79 for states in the set
a94a86a107 cover the same number of source positions a111 . When
a state set a94a118a107 is processed, the decoder has finished pro-
cessing all states in the sets a94 a125 where a155a157a156a158a111 . Before ex-
panding a state set, the decoder prunes a state set based on
its coverage vector and the path costs only: two different
pruning strategies are used that have been introduced in
(Tillmann and Ney, 2003): 1) coverage pruning prunes
states that share the same coverage vector a79 , 2) cardi-
nality pruning prunes states according to the cardinal-
ity a19 a4a159a79a150a11 of covered positions: all states in the beam are
compared with each other. Since the states are kept in
a115a160a131
a113 separate lists, which are pruned independently of
each others, this decoder version is called multi-beam
decoder. The decoder uses a matcher function when ex-
panding a state: for a state a105 it looks for uncovered source
positions to find source phrase matches for blocks. Up-
dating a state in Table 1 includes adding the state if it is
not yet present or updating its shortest path cost a85 : if the
11
state is already in a94a86a95 only the state with the lower path
cost a85 is kept. This inserting/updating operation is also
called recombination or relaxation in the context of a
dag search algorithm (cf. Section 3). The update proce-
dure also stores for each state a105 a39 its predecessor state in a
so-called back-pointer array (Ney et al., 1992). The final
block alignment and target translation can be recovered
from this back-pointer array once the final state set a94 a96
has been computed. a155a146a4a7a105a12a39a7a11 is the source phrase length of
the matching block a6a20a39 when going from a105 to a105a12a39 . This al-
gorithm is similar to the beam-search algorithm presented
in (Koehn, 2004): it allows states to be added to a stack
that is not the stack for the successor cardinality. a105 a98 is the
initial decoder state, where no source position is covered:
a79a161a15a138a109 . For the final states in a94
a96 all source positions are
covered.
2.2 Single-Beam Implementation
The second implementation uses two lists to keep a single
beam of active states. This corresponds to a beam-search
decoder in speech recognition, where path hypotheses
corresponding to word sequences are processed in a time-
synchronous way and at a given time step only hypothe-
ses within some percentage of the best hypothesis are
kept (Lowerre and Reddy, 1980). The single-beam de-
coder processes hypotheses cardinality-synchronously,
i.e. the states at stage a111 generate new states at position
a111
a131
a113 . In order to make the use of a single beam possible,
we slightly modify the state transitions in Eq. 3:
a37a79
a23
a155
a23a82a81a108a83a86a85
a40a162a120a136a145 a37a79 a39
a23
a155a39
a23a82a81a84a83a86a85 a39
a40
a23 (4)
a37a79
a23
a30
a23a97a81a134a83a86a85
a40a163a120a122a121 a37a79 a39
a23
a155a39 a15a164a111
a23a150a81 a39 a83a118a85 a39
a40a77a35 (5)
Here, Eq. 5 corresponds to the matcher definition in Eq. 3.
We add an additional field that is a pointer keeping track
of how much of the recent source phrase match has been
covered. In Eq. 5, when a block is matched to the input
sentence, this pointer is set to position k where the most
recent block match starts. We use a dot a30 to indicate that
when a block is matched, the matching position of the
predecessor state can be ignored. While the pointer a155 is
not yet equal to the end position of the match a111a25a39 , it is in-
creased a155a102a39a149a100a133a15a165a155 a131 a113 as shown in Eq. 4. The path cost a85
is set: a85 a39a137a15 a85a84a131a141a166 , where a166 is the state transition cost
a85
a4a7a105
a23
a105a12a39a60a11 divided by the source phrase length of block a6a20a39 :
we evenly spread the cost of generating a6a20a39 over all source
positions being matched. The new coverage vector a79a147a39
is obtained from a79 by adding the scanned position a155a39 :
a79a45a39a99a15a117a79a134a139a134a103a77a155a18a39a34a106 . The algorithm that makes use of the above
definitions is shown in Table 2. The states are stored in
only two state sets a94 and a94a142a39 : a94 contains the most prob-
able hypotheses that were kept in the last beam pruning
step all of which cover a111 source positions. a94a99a39 contains all
the hypotheses in the current beam that cover a111
a131
a113 source
positions. The single-beam decoder in Table 2 uses two
procedures: the scanner and the matcher correspond to
the state transitions in Eq. 4 and Eq. 5. Here, the matcher
simply matches a block to an uncovered portion of the
input sentence. After the matcher has matched a block,
that block is processed in a cardinality-synchronous way
using the scanner procedure as described above. The
predicate CLOSEDa167a168a4a20a105a142a11 is used to switch between match-
ing and scanning states. The predicate CLOSEDa167a169a4a7a105a142a11 is
true if the pointer a155 is equal to the match end position
a111a25a39 (this is stored in a81 a39 ). At this point, the position-by-
position match of the source phrase is completed and we
can search for additional block matches.
3 DP Shortest Path Algorithm for dag
This section analyzes the relationship between the block
decoding algorithms in this paper and a single-source
shortest path finding algorithm for a directed acyclic
graphs (dag). We closely follow the presentation in (Cor-
men et al., 2001) and only sketch the algorithm here: a
dag a170a171a15a114a4a7a172 a23a50a173 a11 is a weighted graph for which a topolog-
ical sort of its vertex set a172 exists: all the vertexes can be
enumerated in linear order. For such a weighted graph,
the shortest path from a single source can be computed
in a174a101a4a20a175a102a172a84a175 a131 a175a173 a175a90a11 time, where a175a102a172a134a175 is the number of ver-
texes and a175a173 a175 number of edges in the graph. The dag
search algorithm runs over all vertexes a105 in topological
order. Assuming an adjacency-list representation of the
dag, for each vertex a105 , we loop over all successor ver-
texes a105a12a39 , where each vertex a105 with its adjacency-list is
processed exactly once. During the search, we maintain
for each vertex a105 a39 an attribute a85 a37a18a105 a39a40 , which is an upper
bound on the shortest path cost from the source vertex a1
to the vertex a105a12a39 . This shortest path estimate is updated
or relaxed each time the vertex a105a12a39 occurs in some adja-
cency list. Ignoring the pruning, the a93 -Beam decoding
algorithm in Table 1 and the dag search algorithm can be
compared as follows: states correspond to dag vertexes
and state transitions correspond to dag edges. Using two
loops for the multi-beam decoder while generating states
in stages is just a way of generating a topological sort of
the search states on the fly: a linear order of search states
is generated by appending the search states in the state
lists a94 a98 , a94 a10 , etc. .
The analysis in terms of a dag shortest path algorithm
can be used for a simple complexity analysis of the pro-
posed algorithms. Local state transitions correspond to
an adjacency-list traversal in the dag search algorithm.
These involve costly lookup operations, e.g. language,
distortion and translation model probability lookup. Typ-
ically the computation time for update operations on lists
a94 is negligible compared to these probability lookups.
So, the search algorithm complexity is simply computed
as the number of edges in the search graph: a174a101a4a60a175a18a172a108a175
a131
a175
a173
a175a133a11a122a176a135a174a101a4a20a175
a173
a175a133a11 (this analysis is implicit in (Tillmann,
12
2001)). Without proof, for the search algorithm in Sec-
tion 2.1 we observe that the number of states is finite and
that all the states are actually reachable from the start
state a105a87a98 . This way for the single-word based search in
(Tillmann and Ney, 2003), a complexity of a174a101a4a60a175a18a172
a28
a175a177
a30
a115a147a178
a30
a179
a96
a11 is shown, where a175a18a172
a28
a175 is the size of the target vo-
cabulary and a115 is the length of the input sentence. The
complexity is dominated by the exponential number of
coverage vectorsa79 that occur in the search, and the com-
plexity of phrase-based decoding is higher yet since its
hypotheses store a source interval a37a38a25a39 a23 a38a89a40 rather than a sin-
gle source position a38 . In the general case, no efficient
search algorithm exists to search all word or phrase re-
orderings (Knight, 1999). Efficient search algorithms can
be derived by the restricting the allowable coverage vec-
tors (Tillmann, 2001) to local word re-ordering only. An
efficient phrase alignment method that does not make use
of re-ordering restriction is demonstrated in the following
section.
4 Efficient Block Alignment Algorithm
A common approach to phrase-based SMT is to learn
phrasal translation pairs from word-aligned training data
(Och and Ney, 2004). Here, a word alignment a180 is a
subset of the Cartesian product of source and target posi-
tions:
a180a182a181a183a103a87a113
a23
a30a50a30a97a30
a23a7a184
a106a171a185a123a103a87a113
a23
a30a50a30a97a30
a23a50a115
a106a25a35
Here, a184 is the target sentence length and a115 is the source
sentence length. The phrase learning approach in (Och
and Ney, 2004) takes two alignments: a source-to-target
alignment a180
a10 and a target-to-source alignment
a180
a178
. The
intersection of these two alignments is computed to ob-
tain a high-precision word alignment. Here, we note that
if the intersection covers all source and target positions
(as shown in Fig. 4), it constitutes a bijection between
source and target sentence positions, since the intersect-
ing alignments are functions according to their definition
in (Brown et al., 1993) 3. In this paper, an algorithmic jus-
tification for restricting blocks based on word alignments
is given. We assume that source and target sentence are
given, and the task is to compute the lowest scoring block
alignment. Such an algorithm might be important in some
discriminative training procedure that relies on decoding
the training data efficiently.
To restrict the block selection based on word aligned
training data, interval projection functions are defined as
follows 4: a140 is a source interval and a148 is an target inter-
3(Tillmann, 2003) reports an intersection coverage of about
a186a168a187 % for Arabic-English parallel data, and a coverage of
a188a97a189
% for Chinese-English data. In the case of uncomplete cov-
erage, the current algorithm can be extended as described in
Section 4.1.
4(Och and Ney, 2004) defines the notion of consistency
for the set of phrasal translations that are learned from word-
Figure 3: Following the definition in Eq. 6, the left pic-
ture shows three admissible block links while the right
picture shows three non-admissible block links.
val. a190a87a191a12a192a14a38
a28
a4a7a140a193a11 is the set of target positionsa194 such that the
alignment point a4a133a194 a23 a38a195a11 occurs in the alignment seta180 anda38
is covered by the source interval a140 . a190a87a191a55a192a14a38
a145
a4a133a148a196a11 is defined
accordingly. Formally, the definitions look like this:
a190a77a191a55a192a14a38
a28
a4a7a140a193a11a197a15 a103a99a194a118a175a87a4a90a194
a23
a38a198a11a118a199a112a180 anda38a108a199a122a140a31a106
a190a77a191a55a192a14a38
a145
a4a133a148a108a11a197a15 a103a99a38a29a175a77a4a90a194
a23
a38a198a11a118a199a200a180 anda194a201a199a153a148a143a106
In order to obtain a particularly simple block alignment
algorithm, the allowed block links a4a20a140
a23
a148a108a11 are restricted
by an ADMISSIBILITY restriction, which is defined as fol-
lows:
a4a133a148
a23
a140a193a11 is admissible iff (6)
a190a77a191a55a192a14a38
a145
a4a133a148a108a11a118a181a104a140 anda190a87a191a55a192a14a38
a28
a4a7a140a5a11a118a181a141a148
Admissibility is related to the word re-ordering problem:
for the source positions in an interval a140 and for the target
positions in an intervala148 , all word re-ordering involving
these positions has to take place within the block defined
by a140 and a148 . Without an underlying alignment a180 each
pair of source and target intervals would define a possi-
ble block link: the admissibility reduces the number of
block links drastically. Examples of admissible and non-
admissible blocks are shown in Fig. 3.
If the alignmenta180 is a bijection, by definition each tar-
get position a194 is aligned to exactly one source position a38
and vice versa and source and target sentence have the
same length. Because of the admissibility definition, a
target interval clumping alone is sufficient to determine
the source interval clumping and the clump alignment.
In Fig. 4, a bijection word alignment for a sentence pair
that consists of a115 a15a114a202 source and a184 a15a114a202 target words is
shown, where the alignment links that yield a bijection
are shown as solid dots. Four admissible block align-
ments are shown as well. An admissible block alignment
is always guaranteed to exist: the block that covers all
source and target position is admissible by definition. The
underlying word alignment and the admissibility restric-
tion play together to reduce the number of block align-
ments: out of all eight possible target clumpings, only
aligned training data which is equivalent.
13
Table 3: Efficient DP-based block alignment algorithm
using an underlying word alignment a180 . For simplicity
reasons, the block score a19 a4a20a6a20a39a146a11 is computed based on the
block identity a6 a39 only.
input: Parallel sentence pair and alignment a180 .
initialization: a203a134a4a127a116a204a11a82a15a132a116 a83 a203a134a4a133a194a97a11a82a15a104a205 a83a142a206 a4a133a194 a23 a194a7a39a146a11a82a15a110a205 a83
fora194 a23 a194a20a39a99a15a104a113 a23 a30a97a30a50a30 a23a7a184 .
for eacha194a207a15a104a113 a23 a179 a23 a30a50a30a97a30 a23a7a184 do
a203a134a4a133a194a97a11a82a15a117a208a84a209a133a210
a13a129
a206
a4a133a194
a23
a194 a39 a11
a131
a203a84a4a133a194 a39 a11 , where
a206
a4a90a194
a23
a194a20a39a34a11a150a15
a19
a4a7a6a60a39a146a11 if block a6a20a39 results from admissible
block link a4a133a148 a23 a140a193a11 , wherea148a88a15a114a37a194 a39 a131 a113 a23 a194a7a40
traceback:
- find best end hypothesis: a203a134a4a184 a11
Figure 4: Four admissible block alignments in case the
word alignment intersection is a bijection. The block
alignment which covers the whole sentence pair with a
single block is not shown.
five yield segmentations with admissible block links.
The DP-based algorithm to compute the block sequence
with the highest score a203a134a4a90a194a97a11 is shown in Table 3. Here, the
following auxiliary quantity is used:
a203a84a4a133a194a97a11 := score of the best partial segmentation
that covers the target interval a37a102a113 a23 a194a20a40 .
Target intervals are processed from bottom to top. A
target interval a148a211a15a212a37a194 a39 a23 a194a20a40 is projected using the word
alignmenta180 , where a given target interval might not yield
an admissible block. For the initialization, we set a203a134a4a133a194a97a11a118a15
a205 and the final score is obtained as a203a214a213 a13
a8a25a215
a125 a15a143a203a84a4
a184
a11 . The
complexity of the algorithm is a216a84a4a184 a178 a11 where the time to
compute the cost a19 a4a7a6a60a39a146a11 and the time to compute the inter-
val projections are ignored. Using the alignment links a180 ,
the segmentation problem is essentially linearized: the
Figure 5: An example for a block alignment involving
a non-aligned column. The right-most alignment is not
allowed by the closure restriction.
target clumping is generated sequentially from bottom-
to-top and it induces some source clumping in an order
which is defined by the word alignment.
4.1 Incomplete Bijection Coverage
In this section, an algorithm is sketched that works if
the intersection coverage is not complete. In this case,
a given target interval may produce several admissible
block links since it can be coupled with different source
intervals to form admissible block links, e.g. in Fig. 5, the
target interval a37a116 a23 a113a24a40 is linked to two source intervals and
both resulting block links do not violate the admissibility
restriction. The minimum score block translation can be
computed using either the one-beam or the multi-beam
algorithm presented earlier. The search state definition in
Eq. 2 is modified to keep track of the current target posi-
tion a194 the same way as the recursive quantity a203a84a4a133a194a97a11 does
this in the algorithm in Table 3:
a37a14a79
a23a150a81a217a23
a194
a83a86a85
a40a55a35 (7)
Additionally, a complex block history a81 as defined in Sec-
tion 2 can be used. Before the search is carried out, the set
of admissible block links for each target interval is pre-
computed and stored in a table where a simple look-up
for each target interval a37a194a20a39 a23 a194a20a40 is carried out during align-
ment. The efficiency of the block alignment algorithm
depends on the alignment intersection coverage.
5 Beam-Search Results
In this section, we present results for the beam-search
algorithms introduced in Section 2. The MT03 Arabic-
English NIST evaluation test set consisting of a218a77a218a77a78 sen-
tences with a113a24a218 a179 a36a24a219 Arabic words is used for the experi-
ments. Translation results in terms of uncased BLEU us-
ing a202 reference translations are reported in Table 4 and
Table 5 for the single-beam (a140 -Beam) and the multi-
beam (a93 -Beam) search algorithm. For all re-ordering
experiments, the notion of skips is used (Tillmann and
Ney, 2003) to restrict the phrase re-ordering: the number
of skips restricts the number of holes in the coverage vec-
tor for a left-to-right traversal of the input sentence. All
14
Table 4: Effect of the skip parameter for the two search
strategies. a220a14a95a221a15 a179 a35a0 a23 a220a60a222a153a15a114a113a89a35a116 and window widtha27 a15a117a218 .
Skip BLEU CPU BLEU CPU
a140 -Beam [secs] a93 -Beam [secs]
0 a202a87a116a195a35a159a36a201a223a224a113a25a35a202 a113a3a116a77a219 a202a87a116a195a35a225a29a223a224a113a25a35a0 a113a77a113a24a218
1 a202a87a202a195a35a159a113a201a223a224a113a25a35a0 a36 a179 a225 a202a87a202a195a35a159a113a5a223a224a113a25a35a218 a179 a202 a0 a225
2 a202a87a202a195a35a78a217a223a224a113a25a35a218 a202a87a202a77a116a77a219 a202a87a202a195a35a202a29a223a224a113a25a35a218 a219a77a202a77a78a147a36
3 a202a87a202a195a35a78a217a223a224a113a25a35a218 a36a3a202a77a218a204a36 a202a87a202a195a35a0 a223a224a113a25a35a218 a113a3a116a77a116a87a202a77a219
re-ordering takes place in a window of size a27 a15a143a218 , such
that only local block re-ordering is handled.
The following block bigram scoring is used: a
block pair a4a20a6
a83
a6a60a39a34a11 with corresponding source phrase
matches a4a60a37a38
a23
a38a25a39a90a40
a23
a37a26a111
a23
a111a25a39a133a40a146a11 is represented as a feature-vector
a32
a4a7a6
a83
a6 a39 a11a226a199 a227 a228 . The feature-vector components are
the negative logarithm of some probabilities as well as
a word-penalty feature. The real-valued features in-
clude the following: a block translation score derived
from phrase occurrence statistics a4a9a113a77a11 , a trigram language
model to predict target words a4a179a112a229 a78a204a11 , a lexical weight-
ing score for the block internal words a4a127a202a204a11 , a distortion
model a4a0a207a229 a218a147a11 as well as the negative target phrase length
a4a60a36a87a11 . The transition cost is computed as
a19
a4a20a6
a23
a6 a39 a11a224a15
a27 a28 a30a89a32
a4a7a6
a83
a6a20a39a34a11 , where
a27
a199a230a227 a228 is a weight vector that sums
up to a113a89a35a116 : a228
a13a26a17
a10
a27
a13a217a15a231a113a25a35a116 . The weights are trained us-
ing a procedure similar to (Och, 2003) on held-out test
data. A block set of a225a198a35a0 million blocks, which are not
filtered according to any particular test set is used, which
has been generated by a phrase-pair selection algorithm
similar to (Al-Onaizan et al., 2004). The training data is
sentence-aligned consisting of a78a198a35a78 million training sen-
tence pairs.
Beam-search results are presented in terms of two
pruning thresholds: the coverage pruning threshold a220 a222
and the cardinality pruning threshold a220 a95 (Tillmann and
Ney, 2003). To carry out the pruning, the minimum cost
with respect to each coverage set a79 and cardinality a19 are
computed for a state set a94 . For the coverage pruning,
states are distinguished according to the subset of cov-
ered positions a79 . The minimum cost a232a203a207a233a142a4a159a79a150a11 is defined
as: a232a203a221a233a142a4a90a79a157a11a157a15a224a208a108a209a133a210a198a234a235a103 a85 a175a193a37a79 a23a50a81a45a83a24a85 a40a99a199a152a94a193a106 . For the cardinality
pruning, states are distinguished according to the cardi-
nality a19 a4a159a79a157a11 of subsets a79 of covered positions. The min-
imum cost a232a203 a233 a4 a19 a11 is defined for all hypotheses with the
same cardinality a19 a4a90a79a157a11a82a15 a19 : a232a203 a233 a4 a19 a11a82a15a132a208a108a209a102a210 a236
a237a34a238
a236a9a239a18a240
a237
a232a203 a233 a4a159a79a150a11 .
States a105 ina94 are pruned if the shortest path cost a85 a4a20a105a142a11 is
greater than the minimum cost plus the pruning threshold:
a85
a4a7a105a142a11a242a241 a220a60a222
a131
a232a203a207a233a142a4a159a79a150a11
a85
a4a7a105a142a11a242a241 a220a14a95
a131
a232a203a207a233a142a4
a19
a11
The same state set pruning is used for the a140 -Beam and
Table 5: Effect of the coverage pruning threshold a220a60a222 on
BLEU and the overall CPU time [secs]. To restrict the
overall search space the cardinality pruning is set to a220a14a95a207a15
a113a24a116a198a35a116 and the cardinality histogram pruning is set to
a179a77a0
a116a87a116 .
a220 a222 BLEU CPU BLEU CPU
a140 -Beam [secs] a93 -Beam [secs]
0.001 a78a147a36a89a35a0 a223a224a113a25a35a202 106 a202a87a116a195a35a0 a223a161a113a89a35a0 198
0.01 a78a87a219a195a35a78a29a223a224a113a25a35a202 109 a202a147a113a89a35a116a137a223a161a113a89a35a0 213
0.05 a202a87a116a195a35a159a36a5a223a224a113a25a35a0 139 a202a87a78a195a35a179 a223a161a113a89a35a218 301
0.1 a202 a179 a35a218a29a223a224a113a25a35a0 215 a202a87a202a195a35a179 a223a161a113a89a35a218 508
0.25 a202a87a202a195a35a159a113a5a223a224a113a25a35a218 1018 a202a87a202a195a35a202a137a223a161a113a89a35a218 1977
0.5 a202a87a202a195a35a78a29a223a224a113a25a35a218 4527 a202a87a202a195a35a202a137a223a161a113a89a35a218 6289
1.0 a202a87a202a195a35a78a29a223a224a113a25a35a218 6623 a202a87a202a195a35a0 a223a161a113a89a35a218 8092
2.5 a202a87a202a195a35a78a29a223a224a113a25a35a218 6797 a202a87a202a195a35a0 a223a161a113a89a35a218 8187
5.0 a202a87a202a195a35a78a29a223a224a113a25a35a218 6810 a202a87a202a195a35a0 a223a161a113a89a35a218 8191
the a93 -Beam search algorithms. Table 4 shows the ef-
fect of the skip size on the translation performance. The
pruning thresholds are set to conservatively large values:
a220 a95 a15
a179
a35
a0 and
a220 a222 a15a243a113a89a35a116 . Only if no block re-ordering
is allowed (a1a168a111a89a194a34a190a164a15a88a116 ), performance drops significantly.
The a140 -Beam search is consistently faster than a93 -Beam
search algorithm. Table 5 demonstrates the effect of the
coverage pruning threshold. Here, a conservatively large
cardinality pruning threshold of a220a14a95a117a15a244a113a3a116a195a35a116 and the so-
called histogram pruning to restrict the overall number
of states in the beam to a maximum number of a179a77a0 a116a77a116
are used to restrict the overall search space. The a140 -
Beam search algorithm is consistently faster than the a93 -
Beam search algorithm for the same pruning threshold,
but performance in terms of BLEU score drops signifi-
cantly for lower coverage pruning thresholdsa220a60a222a153a156a224a116a195a35a0 as
a smaller portion of the overall search space is searched
which leads to search errors. For larger pruning thresh-
olds a220 a222a143a245 a116a198a35a0 , where the performance of the two algo-
rithms in terms of BLEU score is nearly identical, the
a140 -Beam algorithm runs significantly faster. For a cover-
age threshold of a220 a222 a15a158a116a198a35a90a113 , the a140 -Beam algorithm is as
fast as the a93 -Beam algorithm ata220a60a222a200a15a132a116a195a35a116a204a113 , but obtains a
significantly higher BLEU score of a202
a179
a35a218 versus a202a204a113a25a35a116 for
the a93 -Beam algorithm. The results in this section show
that the a140 -Beam algorithm generally runs faster since the
beam search pruning is applied to all states simultane-
ously making more efficient use of the beam search con-
cept.
6 Discussion
The decoding algorithm shown here is most similar to
the decoding algorithms presented in (Koehn, 2004) and
(Och and Ney, 2004), the later being used for the Align-
ment Template Model for SMT. These algorithms also
15
include an estimate of the path completion cost which
can easily be included into this work as well ((Tillmann,
2001)). (Knight, 1999) shows that the decoding problem
for SMT as well as some bilingual tiling problems are
NP-complete, so no efficient algorithm exists in the gen-
eral case. But using DP-based optimization techniques
and appropriate restrictions leads to efficient DP-based
decoding algorithms as shown in this paper.
The efficient block alignment algorithm in Section 4 is
related to the inversion transduction grammar approach to
bilingual parsing described in (Wu, 1997): in both cases
the number of alignments is drastically reduced by in-
troducing appropriate re-ordering restrictions. The list-
based decoding algorithms can also be compared to an
Earley-style parsing algorithm that processes list of parse
states in a single left-to-right run over the input sentence.
For this algorithm, the comparison in terms of a shortest-
path algorithm is less obvious: in the so-called comple-
tion step the parser re-visits states in previous stacks. But
it is interesting to note that there is no multiple lists vari-
ant of that parser. In phrase-based decoding, a multiple
list decoder is feasible only because exact phrase matches
occur. A block decoding algorithm that would allow for
a ’fuzzy’ match of source phrases, e.g. insertions or dele-
tions of some source phrase words are allowed, would
need to carry out its computations using two stacks since
the match end of a block is unknown.
7 Acknowledgment
This work was partially supported by DARPA and mon-
itored by SPAWAR under contract No. N66001-99-2-
8916. The author would like to thank the anonymous
reviewers for their detailed criticism on this paper.

References
Yaser Al-Onaizan, Niyu Ge, Young-Suk Lee, Kishore Pa-
pineni, Fei Xia, and Christoph Tillmann. 2004. IBM
Site Report. In NIST 2004 MT Workshop, Alexandria,
VA, June. IBM.
Adam L. Berger, Peter F. Brown, Stephen A. Della
Pietra, Vincent J. Della Pietra, Andrew S. Kehler, and
Robert L. Mercer. 1996. Language Translation Ap-
paratus and Method of Using Context-Based Trans-
lation Models. United States Patent, Patent Number
5510981, April.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della
Pietra, and Robert L. Mercer. 1993. The Mathematics
of Statistical Machine Translation: Parameter Estima-
tion. Computational Linguistics, 19(2):263–311.
Thomas H. Cormen, Charles E. Leiserson, Ronald L.
Rivest, and Clifford Stein. 2001. Introduction to Al-
gorithms. MIT Press, Cambridge Massachusetts.
Stuart E. Dreyfus and Averill M. Law. 1977. The Art
and Theory of Dynamic Programming (Mathematics in
Science and Engineering; vol. 130). Acadamic Press,
New York, N.Y.
Held and Karp. 1962. A Dynamic Programming Ap-
proach to Sequencing Problems. SIAM, 10(1):196–
210.
Fred Jelinek. 1998. Statistical Methods for Speech
Recognition. The MIT Press, Cambridge, MA.
Kevin Knight. 1999. Decoding Complexity in Word-
Replacement Translation Models. CL, 25(4):607–615.
Philipp Koehn. 2004. Pharaoh: a Beam Search Decoder
for Phrase-Based Statistical Machine Translation Mod-
els. In Proceedings of AMTA 2004, Washington DC,
September-October.
Bruce Lowerre and Raj Reddy. 1980. The Harpy speech
understanding system, in Trends in Speech Recogni-
tion, W.A. Lea, Ed. Prentice Hall, EngleWood Cliffs,
NJ.
H. Ney, D. Mergel, A. Noll, and A. Paeseler. 1992. Data
Driven Search Organization for Continuous Speech
Recognition in the SPICOS System. IEEE Transac-
tion on Signal Processing, 40(2):272–281.
S. Niessen, S. Vogel, H. Ney, and C. Tillmann. 1998.
A DP-Based Search Algorithm for Statistical Machine
Translation. In Proc. of ACL/COLING 98, pages 960–
967, Montreal, Canada, August.
Franz-Josef Och and Hermann Ney. 2004. The Align-
ment Template Approach to Statistical Machine Trans-
lation. Computational Linguistics, 30(4):417–450.
Franz Josef Och. 2003. Minimum error rate training
in statistical machine translation. In Proceedings of
ACL’03, pages 160–167, Sapporo, Japan.
Christoph Tillmann and Hermann Ney. 2003. Word Re-
ordering and a DP Beam Search Algorithm for Statis-
tical Machine Translation. CL, 29(1):97–133.
Christoph Tillmann, Stefan Vogel, Hermann Ney, and
Alex Zubiaga. 1997. A DP-based Search Using
Monotone Alignments in Statistical Translation. In
Proc. of ACL 97, pages 289–296, Madrid,Spain, July.
Christoph Tillmann. 2001. Word Re-Ordering and Dy-
namic Programming based Search Algorithm for Sta-
tistical Machine Translation. Ph.D. thesis, University
of Technology, Aachen, Germany.
Christoph Tillmann. 2003. A Projection Extension Al-
gorithm for Statistical Machine Translation. In Proc.
of EMNLP 03, pages 1–8, Sapporo, Japan, July.
Dekai Wu. 1997. Stochastic Inversion Transduction
Grammars and Bilingual Parsing of Parallel Corpora.
Computational Linguistics, 23(3):377–403.
