A Decoder for Syntax-based Statistical MT
Kenji Yamada and Kevin Knight
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
a0 kyamada,knight
a1 @isi.edu
Abstract
This paper describes a decoding algorithm
for a syntax-based translation model (Ya-
mada and Knight, 2001). The model
has been extended to incorporate phrasal
translations as presented here. In con-
trast to a conventional word-to-word sta-
tistical model, a decoder for the syntax-
based model builds up an English parse
tree given a sentence in a foreign lan-
guage. As the model size becomes huge in
a practical setting, and the decoder consid-
ers multiple syntactic structures for each
word alignment, several pruning tech-
niques are necessary. We tested our de-
coder in a Chinese-to-English translation
system, and obtained better results than
IBM Model 4. We also discuss issues con-
cerning the relation between this decoder
and a language model.
1 Introduction
A statistical machine translation system based on the
noisy channel model consists of three components:
a language model (LM), a translation model (TM),
and a decoder. For a system which translates from
a foreign language a2 to English a3 , the LM gives
a prior probability Pa4a5a3a7a6 and the TM gives a chan-
nel translation probability Pa4a5a2a9a8a3a10a6 . These models
are automatically trained using monolingual (for the
LM) and bilingual (for the TM) corpora. A decoder
then finds the best English sentence given a foreign
sentence that maximizes Pa4a5a3a11a8a2a10a6 , which also maxi-
mizes Pa4a5a2a9a8a3a7a6a13a12a14a4a5a3a10a6 according to Bayes’ rule.
A different decoder is needed for different choices
of LM and TM. Since Pa4a5a3a7a6 and Pa4a5a2a11a8a3a10a6 are not sim-
ple probability tables but are parameterized models,
a decoder must conduct a search over the space de-
fined by the models. For the IBM models defined
by a pioneering paper (Brown et al., 1993), a de-
coding algorithm based on a left-to-right search was
described in (Berger et al., 1996). Recently (Ya-
mada and Knight, 2001) introduced a syntax-based
TM which utilized syntactic structure in the chan-
nel input, and showed that it could outperform the
IBM model in alignment quality. In contrast to the
IBM models, which are word-to-word models, the
syntax-based model works on a syntactic parse tree,
so the decoder builds up an English parse tree a15
given a sentence a16 in a foreign language. This pa-
per describes an algorithm for such a decoder, and
reports experimental results.
Other statistical machine translation systems such
as (Wu, 1997) and (Alshawi et al., 2000) also pro-
duce a tree a15 given a sentence a16 . Their models are
based on mechanisms that generate two languages
at the same time, so an English tree a15 is obtained
as a subproduct of parsing a16 . However, their use of
the LM is not mathematically motivated, since their
models do not decompose into Pa4a5a2a9a8a3a10a6 and a12a14a4a5a3a7a6
unlike the noisy channel model.
Section 2 briefly reviews the syntax-based TM,
and Section 3 describes phrasal translation as an ex-
tension. Section 4 presents the basic idea for de-
coding. As in other statistical machine translation
systems, the decoder has to cope with a huge search
                Computational Linguistics (ACL), Philadelphia, July 2002, pp. 303-310.
                         Proceedings of the 40th Annual Meeting of the Association for
space. Section 5 describes how to prune the search
space for practical decoding. Section 6 shows exper-
imental results. Section 7 discusses LM issues, and
is followed by conclusions.
2 Syntax-based TM
The syntax-based TM defined by (Yamada and
Knight, 2001) assumes an English parse tree a15 as
a channel input. The channel applies three kinds of
stochastic operations on each node a17a19a18 : reordering
children nodes (a20 ), inserting an optional extra word
to the left or right of the node (a21 ), and translating
leaf words (a22 ).1 These operations are independent
of each other and are conditioned on the features
(a23 ,a24 ,a25 ) of the node. Figure 1 shows an example.
The child node sequence of the top node VB is re-
ordered from PRP-VB1-VB2 into PRP-VB2-VB1
as seen in the second tree (Reordered). An extra
word ha is inserted at the leftmost node PRP as seen
in the third tree (Inserted). The English word He un-
der the same node is translated into a foreign word
kare as seen in the fourth tree (Translated). After
these operations, the channel emits a foreign word
sentence a16 by taking the leaves of the modified tree.
Formally, the channel probability Pa4a5a16a26a8a15a27a6 is
Pa28a30a29a32a31a33a26a34a36a35 a37
a38a26a39a40a42a41a44a43a45a41a47a46a49a48a50a48a50a51a53a52
a54
a55a56
a51a58a57
Pa28a60a59
a56
a31a61
a56
a34
Pa28a60a59
a56
a31a61
a56
a34a36a35 a62a64a63
a28a30a65
a56
a31a66
a56
a34a68a67a69a28a60a70
a56
a31a71
a56
a34 if a61
a56
is terminal
a72
a28a50a73
a56
a31a74
a56
a34a32a67a69a28a60a70
a56
a31a71
a56
a34 otherwise
where a75 a76 a77a42a78a68a79a80a77a82a81a83a79a68a84a68a84a68a84a83a79a80a77a19a85a86a76 a87a88a20a89a78a32a79a90a21a91a78a92a79a80a22a82a78a94a93 ,
a87a88a20 a81 a79a90a21 a81 a79a80a22 a81 a93a49a79a68a84a68a84a68a84a83a79a19a87a88a20 a85 a79a90a21 a85 a79a80a22 a85 a93 , and a95a96a4a97a75a98a4a88a15a27a6a45a6 is a se-
quence of leaf words of a tree transformed by a75 from
a15 .
The model tables a99a69a4a88a20a100a8a23a101a6 , a102a103a4a97a21a104a8a24a105a6 , and a106a68a4a5a22a104a8a25a7a6 are
called the r-table, n-table, and t-table, respectively.
These tables contain the probabilities of the channel
operations (a20 , a21 , a22 ) conditioned by the features (a23 ,
a24 , a25 ). In Figure 1, the r-table specifies the prob-
ability of having the second tree (Reordered) given
the first tree. The n-table specifies the probability
of having the third tree (Inserted) given the second
1The channel operations are designed to model the differ-
ence in the word order (SVO for English vs. VSO for Arabic)
and case-marking schemes (word positions in English vs. case-
marker particles in Japanese).
tree. The t-table specifies the probability of having
the fourth tree (Translated) given the third tree.
The probabilities in the model tables are automat-
ically obtained by an EM-algorithm using pairs of a15
(channel input) and a16 (channel output) as a training
corpus. Usually a bilingual corpus comes as pairs of
translation sentences, so we need to parse the cor-
pus. As we need to parse sentences on the channel
input side only, many X-to-English translation sys-
tems can be developed with an English parser alone.
The conditioning features (a23 ,a24 ,a25 ) can be any-
thing that is available on a tree a15 , however they
should be carefully selected not to cause data-
sparseness problems. Also, the choice of fea-
tures may affect the decoding algorithm. In our
experiment, a sequence of the child node label
was used for a23 , a pair of the node label and
the parent label was used for a24 , and the identity
of the English word is used for a25 . For exam-
ple, a99a69a4a88a20a107a8a23a108a6a96a76 Pa4 PRP-VB2-VB1a8PRP-VB1-VB2a6
for the top node in Figure 1. Similarly for the
node PRP, a102a103a4a97a21a104a8a24a105a6a109a76 Pa4 right, ha a8VB-PRPa6 and
a106a68a4a5a22a104a8a25a7a6a110a76 Pa4 karea8hea6 . More detailed examples are
found in (Yamada and Knight, 2001).
3 Phrasal Translation
In (Yamada and Knight, 2001), the translation a22 is a
1-to-1 lexical translation from an English word a111 to a
foreign word a112 , i.e., a106a68a4a5a22a104a8a25a7a6a113a76a114a106a92a4a97a112a115a8a111a82a6 . To allow non
1-to-1 translation, such as for idiomatic phrases or
compound nouns, we extend the model as follows.
First we use fertility a116 as used in IBM models to
allow 1-to-N mapping.
a63
a28a30a65a117a31a66a115a34a36a35
a63
a28a119a118
a57
a118a49a120a58a121a122a121a122a121a123a118a80a124a97a31a125a90a34a89a35a11a126a89a28a60a127a5a31a125a90a34
a124
a55a56
a51a117a57 a63
a28a119a118
a56
a31a125a90a34
For N-to-N mapping, we allow direct transla-
tion a128 of an English phrase a111a129a78a49a111a19a81a98a84a68a84a68a84a49a111a68a130 to a foreign
phrase a112a91a78a90a112a129a81a27a84a68a84a68a84a49a112a129a131 at non-terminal tree nodes as
a132a53a133
a28a60a134a69a31a135a107a34a36a35
a63
a28a119a118
a57
a118 a120 a121a122a121a122a121a123a118 a124 a31a125
a57
a125 a120 a121a122a121a122a121a123a125a80a136a27a34
a35 a126a137a28a60a127a88a31a125
a57
a125 a120 a121a13a121a122a121a138a125a80a136a98a34
a124
a55a56
a51a58a57a19a63
a28a119a118
a56
a31a125
a57
a125 a120 a121a122a121a122a121a138a125a80a136a98a34
and linearly mix this phrasal translation with the
word-to-word translation, i.e.,
Pa28a60a59
a56
a31a61
a56
a34a36a35 a139a53a140a83a141
a132a53a133
a28a60a134
a56
a31a135
a56
a34a143a142a101a28a5a144a107a145a146a139a53a140a19a141a138a34
a72
a28a50a73
a56
a31a74
a56
a34a68a67a69a28a60a70
a56
a31a71
a56
a34
1. Channel Input
3.  Inserted
a147a148 a149 a150 a151a152
2. Reordered
a153 a151a154 a148 a149 a155a154 a152a150kare ha ongaku wo kiku no ga daisuki desu
5. Channel Output
a156 a150 a157 a151a158 a150 a151
a156 a150 a154 a158 a159a148 a160 a157 a161a161 a162 a150 a154 a163 a150 a149
4. Translated
a164 a165 a166a167a168a169a165a170 a171a172
a170a173a165a174
a172
a174a175 a176 a177
a170
a172a178
a173a168
VB
PRP VB1 VB2
VB TO
TO NN
VB
VB2
TO
a166a167a168a169a165a170
VB1
a171a172
a170a173a165a174
a172
a174a175
VB
a164 a165
PRP
a176 a177
a170
a172a178
NN
a173a168
TO
VB
a179 a180 a181 a182a183
a180 a184 a185a186a187
VB2
TO VB
a171a172
a170a173a165a174
a172
a174a175
a166a167a168a169a165a170
VB1
a164 a165
PRP
a176 a177
a170
a172a178
NN
a173a168
TO
VB
a179 a180 a181 a182a183
a180 a184 a185a186 a187
VB2
TO VB
PRP
NN TO
VB1
a188 a180 a189
a185
a182 a181
a183
a180 a188
a187
a188 a190a188
a187
a184
a180 a190
a186a187
a188 a190
a191
a182
Figure 1: Channel Operations: Reorder, Insert, and Translate
if a17a19a18 is non-terminal. In practice, the phrase lengths
(a192 ,a193 ) are limited to reduce the model size. In our ex-
periment (Section 5), we restricted them as a194a129a84a60a194a92a193a196a195
a197a199a198
a192
a198
a194a129a84a201a200a83a193a203a202a14a200 , to avoid pairs of extremely differ-
ent lengths. This formula was obtained by randomly
sampling the length of translation pairs. See (Ya-
mada, 2002) for details.
4 Decoding
Our statistical MT system is based on the noisy-
channel model, so the decoder works in the reverse
direction of the channel. Given a supposed chan-
nel output (e.g., a French or Chinese sentence), it
will find the most plausible channel input (an En-
glish parse tree) based on the model parameters and
the prior probability of the input.
In the syntax-based model, the decoder’s task is
to find the most plausible English parse tree given an
observed foreign sentence. Since the task is to build
a tree structure from a string of words, we can use a
mechanism similar to normal parsing, which builds
an English parse tree from a string of English words.
Here we need to build an English parse tree from a
string of foreign (e.g., French or Chinese) words.
To parse in such an exotic way, we start from
an English context-free grammar obtained from the
training corpus,2 and extend the grammar to in-
2The training corpus for the syntax-based model consists of
corporate the channel operations in the translation
model. For each non-lexical rule in the original En-
glish grammar (such as “VP a204 VB NP PP”), we
supplement it with reordered rules (e.g. “VP a204
NP PP VB”, “VP a204 NP VB PP ”, etc.) and asso-
ciate them with the original English order and the
reordering probability from the r-table. Similarly,
rules such as “VP a204 VP X” and “X a204 word” are
added for extra word insertion, and they are associ-
ated with a probability from the n-table. For each
lexical rule in the English grammar, we add rules
such as “englishWord a204 foreignWord” with a prob-
ability from the t-table.
Now we can parse a string of foreign words and
build up a tree, which we call a decoded tree. An
example is shown in Figure 2. The decoded tree is
built up in the foreign language word order. To ob-
tain a tree in the English order, we apply the reverse
of the reorder operation (back-reordering) using the
information associated to the rule expanded by the
r-table. In Figure 2, the numbers in the dashed oval
near the top node shows the original english order.
Then, we obtain an English parse tree by remov-
ing the leaf nodes (foreign words) from the back-
reordered tree. Among the possible decoded trees,
we pick the best tree in which the product of the LM
probability (the prior probability of the English tree)
and the TM probability (the probabilities associated
pairs of English parse trees and foreign sentences.
a205 a206 a205
a205 a206 a205 a207 a208 a207 a208
a209 a209 a210 a211
a207 a208 a207 a208
a210 a211
a207 a208
a207 a208
12
12
ongaku wo kiku no ga
a212 a212
a212
suki
a212
dakare ha a213a214
a215 a216 a217 a218 a219
a220 a221
a207 a208
1 3
a222
a218a217
a213
a221 a223
a218
a223 a224
a225a226a227 a228
a229a230 a229a231 a232 a233a234
a227 a226a235a236 a235a237 a231 a233 a238 a237 a236 a233a235a239 a240 a227 a226a241 a234 a226
a225a226a227 a228 a226a230 a229a231 a232 a233a234
a237 a230 a229a231 a232 a233a234
a225a226a227 a228
a242 a243
a214 a244
a221
a217
2
Figure 2: Decoded Tree
with the rules in the decoded tree) is the highest.
The use of an LM needs consideration. Theoret-
ically we need an LM which gives the prior prob-
ability of an English parse tree. However, we can
approximate it with an n-gram LM, which is well-
studied and widely implemented. We will discuss
this point later in Section 7.
If we use a trigram model for the LM, a con-
venient implementation is to first build a decoded-
tree forest and then to pick out the best tree using a
trigram-based forest-ranking algorithm as described
in (Langkilde, 2000). The ranker uses two leftmost
and rightmost leaf words to efficiently calculate the
trigram probability of a subtree, and finds the most
plausible tree according to the trigram and the rule
probabilities. This algorithm finds the optimal tree
in terms of the model probability — but it is not
practical when the vocabulary size and the rule size
grow. The next section describes how to make it
practical.
5 Pruning
We use our decoder for Chinese-English translation
in a general news domain. The TM becomes very
huge for such a domain. In our experiment (see Sec-
tion 6 for details), there are about 4M non-zero en-
tries in the trained a106a68a4a97a112a115a8a111a83a6 table. About 10K CFG
rules are used in the parsed corpus of English, which
results in about 120K non-lexical rules for the de-
coding grammar (after we expand the CFG rules as
described in Section 4). We applied the simple al-
gorithm from Section 4, but this experiment failed
— no complete translations were produced. Even
four-word sentences could not be decoded. This is
not only because the model size is huge, but also be-
cause the decoder considers multiple syntactic struc-
tures for the same word alignment, i.e., there are
several different decoded trees even when the trans-
lation of the sentence is the same. We then applied
the following measures to achieve practical decod-
ing. The basic idea is to use additional statistics from
the training corpus.
beam search: We give up optimal decoding
by using a standard dynamic-programming parser
with beam search, which is similar to the parser
used in (Collins, 1999). A standard dynamic-
programming parser builds up a245 nonterminal, input-
substringa246 tuples from bottom-up according to the
grammar rules. When the parsing cost3 comes only
from the features within a subtree (TM cost, in our
case), the parser will find the optimal tree by keep-
ing the single best subtree for each tuple. When the
cost depends on the features outside of a subtree,
we need to keep all the subtrees for possible differ-
ent outside features (boundary words for the trigram
LM cost) to obtain the optimal tree. Instead of keep-
ing all the subtrees, we only retain subtrees within a
beam width for each input-substring. Since the out-
side features are not considered for the beam prun-
ing, the optimality of the parse is not guaranteed, but
the required memory size is reduced.
t-table pruning: Given a foreign (Chinese) sen-
tence to the decoder, we only consider English
words a111 for each foreign word a112 such that Pa4a97a111a26a8a112a247a6 is
high. In addition, only limited part-of-speech labels
a248 are considered to reduce the number of possible
decoded-tree structures. Thus we only use the top-5
(a111 ,a248 ) pairs ranked by
Pa28a60a125a92a249a5a250a94a31a118a53a34a251a35 Pa28a50a250a122a34 Pa28a60a125a32a31a250a122a34 Pa28a119a118a117a31a125a92a249a97a250a122a34a97a252 Pa28a119a118a53a34
a253 P
a28a50a250a122a34 Pa28a60a125a32a31a250a122a34 Pa28a119a118a117a31a125a90a34a254a121
Notice that Pa4a97a112a115a8a111a83a6 is a model parameter, and that
Pa4 a248 a6 and Pa4a97a111a26a8a248 a6 are obtained from the parsed training
corpus.
phrase pruning: We only consider limited pairs
(a111a129a78a49a111a19a81a98a84a68a84a68a84a49a111a68a130 ,a112a91a78a90a112a129a81a27a84a68a84a68a84a49a112a129a131 ) for phrasal translation (see
3rule-cost =
a145a104a127 a255
a1 (rule-probability)
Section 2). The pair must appear more than once in
the Viterbi alignments4 of the training corpus. Then
we use the top-10 pairs ranked similarly to t-table
pruning above, except we replace Pa4
a248
a6 Pa4a97a111a26a8
a248
a6 with
Pa4a97a111a82a6 and use trigrams to estimate Pa4a97a111a82a6 . By this prun-
ing, we effectively remove junk phrase pairs, most of
which come from misaligned sentences or untrans-
lated phrases in the training corpus.
r-table pruning: To reduce the number of
rules for the decoding grammar, we use the
top-N rules ranked by Pa4 rulea6 Pa4 reorda6 so that
a2a4a3
a18a6a5 a78
Pa4 rulea18 a6 Pa4 reorda18 a6a64a246a8a7a117a84a10a9a12a11 , where Pa4 rulea6 is
a prior probability of the rule (in the original En-
glish order) found in the parsed English corpus, and
Pa4 reorda6 is the reordering probability in the TM. The
product is a rough estimate of how likely a rule is
used in decoding. Because only a limited number
of reorderings are used in actual translation, a small
number of rules are highly probable. In fact, among
a total of 138,662 reorder-expanded rules, the most
likely 875 rules contribute 95% of the probability
mass, so discarding the rules which contribute the
lower 5% of the probability mass efficiently elimi-
nates more than 99% of the total rules.
zero-fertility words: An English word may be
translated into a null (zero-length) foreign word.
This happens when the fertility a116a103a4a13a7a89a8a111a83a6 a246a14a7 , and such
English word a111 (called a zero-fertility word) must be
inserted during the decoding. The decoding parser
is modified to allow inserting zero-fertility words,
but unlimited insertion easily blows up the memory
space. Therefore only limited insertion is allowed.
Observing the Viterbi alignments of the training cor-
pus, the top-20 frequent zero-fertility words5 cover
over 70% of the cases, thus only those are allowed
to be inserted. Also we use syntactic context to limit
the insertion. For example, a zero-fertility word in
is inserted as IN when “PP a204 IN NP-A” rule is
applied. Again, observing the Viterbi alignments,
the top-20 frequent contexts cover over 60% of the
cases, so we allow insertions only in these contexts.
This kind of context sensitive insertion is possible
because the decoder builds a syntactic tree. Such se-
lective insertion by syntactic context is not easy for
4Viterbi alignment is the most probable word alignment ac-
cording to the trained TM tables.
5They are the, to, of, a, in, is, be, that, on, and, are, for, will,
with, have, it, ’s, has, i, and by.
system P1/P2/P3/P4 LP BLEU
ibm4 36.6/11.7/4.6/1.6 0.959 0.072
syn 39.8/15.8/8.3/4.9 0.781 0.099
syn-nozf 40.6/15.3/8.1/5.3 0.797 0.102
Table 1: Decoding performance
a word-for-word based IBM model decoder.
The pruning techniques shown above use extra
statistics from the training corpus, such as Pa4 a248 a6 ,
Pa4a97a111a26a8a248 a6 , and Pa4 rulea6 . These statistics may be consid-
ered as a part of the LM Pa4a5a3a7a6 , and such syntactic
probabilities are essential when we mainly use tri-
grams for the LM. In this respect, the pruning is use-
ful not only for reducing the search space, but also
improving the quality of translation. We also use
statistics from the Viterbi alignments, such as the
phrase translation frequency and the zero-fertility
context frequency. These are statistics which are not
modeled in the TM. The frequency count is essen-
tially a joint probability Pa4a97a112 a79a90a111a82a6 , while the TM uses
a conditional probability Pa4a97a112a115a8a111a82a6 . Utilizing statistics
outside of a model is an important idea for statis-
tical machine translation in general. For example,
a decoder in (Och and Ney, 2000) uses alignment
template statistics found in the Viterbi alignments.
6 Experimental Results: Chinese/English
This section describes results from our experiment
using the decoder as described in the previous sec-
tion. We used a Chinese-English translation corpus
for the experiment. After discarding long sentences
(more than 20 words in English), the English side of
the corpus consisted of about 3M words, and it was
parsed with Collins’ parser (Collins, 1999). Train-
ing the TM took about 8 hours using a 54-node unix
cluster. We selected 347 short sentences (less than
14 words in the reference English translation) from
the held-out portion of the corpus, and they were
used for evaluation.
Table 1 shows the decoding performance for the
test sentences. The first system ibm4 is a reference
system, which is based on IBM Model4. The second
and the third (syn and syn-nozf) are our decoders.
Both used the same decoding algorithm and prun-
ing as described in the previous sections, except that
syn-nozf allowed no zero-fertility insertions. The
average decoding speed was about 100 seconds6 per
sentence for both syn and syn-nozf.
As an overall decoding performance measure, we
used the BLEU metric (Papineni et al., 2002). This
measure is a geometric average of n-gram accu-
racy, adjusted by a length penalty factor LP.7 The
n-gram accuracy (in percentage) is shown in Table 1
as P1/P2/P3/P4 for unigram/bigram/trigram/4-gram.
Overall, our decoder performed better than the IBM
system, as indicated by the higher BLEU score. We
obtained better n-gram accuracy, but the lower LP
score penalized the overall score. Interestingly, the
system with no explicit zero-fertility word insertion
(syn-nozf) performed better than the one with zero-
fertility insertion (syn). It seems that most zero-
fertility words were already included in the phrasal
translations, and the explicit zero-fertility word in-
sertion produced more garbage than expected words.
system Coverage
r95 92/92
r98 47/92
r100 20/92
system Coverage
w5 92/92
w10 89/92
w20 69/92
Table 2: Effect of pruning
To verify that the pruning was effective, we re-
laxed the pruning threshold and checked the decod-
ing coverage for the first 92 sentences of the test
data. Table 2 shows the result. On the left, the
r-table pruning was relaxed from the 95% level to
98% or 100%. On the right, the t-table pruning was
relaxed from the top-5 (a111 ,a248 ) pairs to the top-10 or
top-20 pairs. The system r95 and w5 are identical
to syn-nozf in Table 1.
When r-table pruning was relaxed from 95% to
98%, only about half (47/92) of the test sentences
were decoded, others were aborted due to lack of
memory. When it was further relaxed to 100% (i.e.,
no pruning was done), only 20 sentences were de-
coded. Similarly, when the t-table pruning threshold
was relaxed, fewer sentences could be decoded due
to the memory limitations.
Although our decoder performed better than the
6Using a single-CPU 800Mhz Pentium III unix system with
1GB memory.
7BLEU
a35a16a15a18a17a20a19a26a28
a2a22a21
a54
a51a117a57a24a23
a54a26a25a28a27a30a29
a132
a54
a34a32a31 LP. LP a35a14a15a33a17a34a19a143a28a5a144 a145
a72
a252a36a35a45a34 if a35a38a37
a72 , and LP
a35 a144 if a35a40a39
a72 , where a23
a54
a35 a144a80a252a18a41 , a41 a35a43a42 ,
a35 is the system output length, and a72 is the reference length.
IBM system in the BLEU score, the obtained gain
was less than what we expected. We have thought
the following three reasons. First, the syntax of Chi-
nese is not extremely different from English, com-
pared with other languages such as Japanese or Ara-
bic. Therefore, the TM could not take advantage
of syntactic reordering operations. Second, our de-
coder looks for a decoded tree, not just for a de-
coded sentence. Thus, the search space is larger than
IBM models, which might lead to more search errors
caused by pruning. Third, the LM used for our sys-
tem was exactly the same as the LM used by the IBM
system. Decoding performance might be heavily in-
fluenced by LM performance. In addition, since the
TM assumes an English parse tree as input, a trigram
LM might not be appropriate. We will discuss this
point in the next section.
Phrasal translation worked pretty well. Figure 3
shows the top-20 frequent phrase translations ob-
served in the Viterbi alignment. The leftmost col-
umn shows how many times they appeared. Most of
them are correct. It even detected frequent sentence-
to-sentence translations, since we only imposed a
relative length limit for phrasal translations (Sec-
tion 3). However, some of them, such as the one with
(in cantonese), are wrong. We expected that these
junk phrases could be eliminated by phrase pruning
(Section 5), however the junk phrases present many
times in the corpus were not effectively filtered out.
7 Decoded Trees
The BLEU score measures the quality of the decoder
output sentences. We were also interested in the syn-
tactic structure of the decoded trees. The leftmost
tree in Figure 4 is a decoded tree from the syn-nozf
system. Surprisingly, even though the decoded sen-
tence is passable English, the tree structure is totally
unnatural. We assumed that a good parse tree gives
high trigram probabilities. But it seems a bad parse
tree may give good trigram probabilities too. We
also noticed that too many unary rules (e.g. “NPB
a204 PRN”) were used. This is because the reordering
probability is always 1.
To remedy this, we added CFG probabilities
(PCFG) in the decoder search, i.e., it now looks for a
tree which maximizes Pa4 trigrama6 Pa4 cfga6 Pa4 TMa6 . The
CFG probability was obtained by counting the rule
a44a46a45a48a47a48a49a51a50a28a52a46a53a55a54a46a56a48a57a48a53a48a58a46a59a48a60a48a53a46a59a61a50a63a62a46a64
a65a48a66a48a67a51a50a28a52a46a53a69a68a70a56a61a71a48a71a28a56a46a72a61a73a28a59a46a54a55a62a48a64
a66a48a47a48a44a74a60a48a75a46a76a48a75a48a60a26a77a48a58a48a53a46a78a61a73a28a76a46a53a48a59a61a50a79a62a48a64
a66a48a45a48a49a51a73a28a59a80a50a28a52a18a73a28a78a55a81a46a56a48a59a48a59a46a53a48a81a61a50a48a73a70a56a48a59a26a62a48a64
a67a48a49a48a67a74a60a48a75a46a76a48a75a48a60a26a77a48a58a48a53a46a78a61a73a28a76a46a53a48a59a61a50a79a62a48a64
a67a48a82a48a44a51a73a28a78a80a73a28a78a46a78a48a83a48a53a48a76a26a56a48a59a26a84a48a53a48a52a46a75a61a71a48a68a79a56a61a68a26a50a28a52a48a53a26a77a48a58a46a56a48a57a61a73a28a78a18a73a28a56a48a59a46a75a61a71a63a83a46a58a48a84a48a75a46a59a55a81a46a56a48a83a48a59a48a81a18a73a48a71a79a62a48a64
a85a48a65a48a67a74a52a48a56a46a59a48a54a26a86a48a56a48a59a48a54a26a62a48a64
a85a48a47a48a85a51a73a28a59a18a68a28a56a48a58a46a60a69a50a28a52a18a73a28a78a26a81a48a56a48a83a48a59a46a81a61a73a48a71a79a62a48a64
a85a48a67a48a87a74a53a61a68a46a68a28a53a48a81a18a50a48a73a28a57a48a53a26a53a48a88a46a81a48a52a48a75a48a59a46a54a48a53a26a58a48a75a61a50a28a53a80a73a28a59a46a76a48a53a48a88a26a62a48a64
a85a48a87a48a47a51a73a28a78a80a73a28a78a46a78a48a83a48a53a48a76a26a56a48a59a26a84a48a53a48a52a46a75a61a71a48a68a79a56a61a68a26a50a28a52a48a53a26a77a48a58a46a56a48a57a61a73a28a78a18a73a28a56a48a59a46a75a61a71a63a58a46a53a48a54a61a73a70a56a48a59a48a75a18a71a63a81a48a56a46a83a48a59a48a81a18a73a48a71a63a62a46a64
a87a48a85a48a45a74a52a48a53a46a58a48a53a80a73a28a78a55a75a46a59a69a73a46a50a28a53a48a60a26a56a61a68a26a73a28a59a61a50a28a53a46a58a48a53a48a78a18a50a55a50a70a56a55a78a48a72a18a73a28a60a48a60a46a53a48a58a48a78a26a62a48a64
a87a48a45a48a82a74a75a61a50a46a50a28a53a48a59a18a50a48a73a28a56a48a59a80a50a28a57a46a89a48a58a48a75a48a76a18a73a28a56a26a75a48a59a48a59a48a56a46a83a48a59a48a81a46a53a48a58a48a78a26a77a61a71a28a53a46a75a48a78a48a53a26a84a48a58a48a56a46a75a48a76a48a81a46a75a48a78a61a50a26a50a28a52a48a53a80a68a28a56a18a71a48a71a28a56a48a72a18a73a28a59a48a54a26a75a48a78a26a78a48a56a48a56a46a59a55a75a46a78a55a77a48a56a46a78a48a78a61a73a70a84a61a71a28a53a26a62a48a64
a44a48a65a48a85a74a60a48a58a26a77a48a58a46a53a48a78a61a73a28a76a46a53a48a59a61a50a79a62a48a64
a44a48a66a48a47a51a50a28a52a46a75a48a59a48a86a26a90a48a56a48a83a26a62a48a64
a44a48a87a48a66a74a58a48a53a46a76a69a68a46a71a28a75a48a54a48a78a26a52a48a56a18a73a28a78a61a50a28a53a46a76a55a62a46a64
a44a48a87a48a45a74a77a48a58a46a53a48a78a61a73a70a76a48a53a48a59a61a50a26a91a26a73a28a59a55a81a46a75a48a59a61a50a70a56a48a59a48a53a48a78a46a53a69a92a79a62a48a64
a44a48a44a48a85a51a50a28a52a46a75a48a59a48a86a26a90a48a56a48a83a26a60a48a75a46a76a48a75a48a60a26a77a48a58a48a53a46a78a61a73a28a76a48a53a46a59a61a50a79a62a48a64
a44a48a44a48a87a74a77a48a83a18a50a63a75a46a59a48a76a55a75a46a54a48a58a48a53a46a53a48a76a69a50a70a56a55a62a46a64
a44a48a44a48a45a74a77a48a58a46a56a48a77a48a56a46a78a48a53a48a76a26a75a48a60a48a53a46a59a48a76a48a60a48a53a46a59a61a50a79a62a48a64
a44a48a45a48a67a51a50a28a52a46a75a48a59a48a86a26a90a48a56a48a83a26a60a48a58a26a77a48a58a48a53a46a78a61a73a28a76a46a53a48a59a61a50a79a62a48a64
Figure 3: Top-20 frequent phrase translations in the Viterbi alignment
frequency in the parsed English side of the train-
ing corpus. The middle of Figure 4 is the output
for the same sentence. The syntactic structure now
looks better, but we found three problems. First, the
BLEU score is worse (0.078). Second, the decoded
trees seem to prefer noun phrases. In many trees, an
entire sentence was decoded as a large noun phrase.
Third, it uses more frequent node reordering than it
should.
The BLEU score may go down because we
weighed the LM (trigram and PCFG) more than the
TM. For the problem of too many noun phrases, we
thought it was a problem with the corpus. Our train-
ing corpus contained many dictionary entries, and
the parliament transcripts also included a list of par-
ticipants’ names. This may cause the LM to prefer
noun phrases too much. Also our corpus contains
noise. There are two types of noise. One is sentence
alignment error, and the other is English parse error.
The corpus was sentence aligned by automatic soft-
ware, so it has some bad alignments. When a sen-
tence was misaligned, or the parse was wrong, the
Viterbi alignment becomes an over-reordered tree as
it picks up plausible translation word pairs first and
reorders trees to fit them.
To see if it was really a corpus problem, we se-
lected a good portion of the corpus and re-trained
the r-table. To find good pairs of sentences in the
corpus, we used the following: 1) Both English and
Chinese sentences end with a period. 2) The En-
glish word is capitalized at the beginning. 3) The
sentences do not contain symbol characters, such as
colon, dash etc, which tend to cause parse errors. 4)
The Viterbi-ratio8 is more than the average of the
pairs which satisfied the first three conditions.
Using the selected sentence pairs, we retrained
only the r-table and the PCFG. The rightmost tree
in Figure 4 is the decoded tree using the re-trained
TM. The BLEU score was improved (0.085), and
the tree structure looks better, though there are still
problems. An obvious problem is that the goodness
of syntactic structure depends on the lexical choices.
For example, the best syntactic structure is different
if a verb requires a noun phrase as object than it is
if it does not. The PCFG-based LM does not handle
this.
At this point, we gave up using the PCFG as a
component of the LM. Using only trigrams obtains
the best result for the BLEU score. However, the
BLEU metric may not be affected by the syntac-
tic aspect of translation quality, and as we saw in
Figure 4, we can improve the syntactic quality by
introducing the PCFG using some corpus selection
techniques. Also, the pruning methods described in
Section 5 use syntactic statistics from the training
corpus. Therefore, we are now investigating more
sophisticated LMs such as (Charniak, 2001) which
8Viterbi-ratio is the ratio of the probability of the most plau-
sible alignment with the sum of the probabilities of all the align-
ments. Low Viterbi-ratio is a good indicator of misalignment or
parse error.
he major contents
PRP
NPB X
NNS
NPBADJP
S
VPS
S
briefed
NNSVBD
NPB
thereporters declaring
NPB
VBG
NP−A
JJDT
NPB
PRN
NPB PRN
PRN
NPB
NP
major contents such statement briefed reporters from others
DT NNNNS VBD
NPB
JJ
NPB
NNS
NPB
NP−A
PP
VP
S
NP−A
he contents
PRP NNSMD JJ
briefed the reporters
VBD DT
VP
NP−A
NNS
should declare major
NPB NPB NPB
XVB
VP−A
VP
S
Figure 4: Effect of PCFG and re-training: No CFG probability (PCFG) was used (left). PCFG was used for
the search (middle). The r-table was re-trained and PCFG was used (right). Each tree was back reordered
and is shown in the English order.
incorporate syntactic features and lexical informa-
tion.
8 Conclusion
We have presented a decoding algorithm for a
syntax-based statistical machine translation. The
translation model was extended to incorporate
phrasal translations. Because the input of the chan-
nel model is an English parse tree, the decoding al-
gorithm is based on conventional syntactic parsing,
and the grammar is expanded by the channel oper-
ations of the TM. As the model size becomes huge
in a practical setting, and the decoder considers mul-
tiple syntactic structures for a word alignment, effi-
cient pruning is necessary. We applied several prun-
ing techniques and obtained good decoding quality
and coverage. The choice of the LM is an impor-
tant issue in implementing a decoder for the syntax-
based TM. At present, the best result is obtained by
using trigrams, but a more sophisticated LM seems
promising.
Acknowledgments
This work was supported by DARPA-ITO grant
N66001-00-1-9814.
References
H. Alshawi, S. Bangalore, and S. Douglas. 2000. Learn-
ing dependency translation models as collections of fi-
nite state head transducers. Computational Linguis-
tics, 26(1).
A. Berger, P. Brown, S. Della Pietra, V. Della Pietra,
J. Gillett, J. Lafferty, R. Mercer, H. Printz, and L. Ures.
1996. Language Translation Apparatus and Method
Using Context-Based Translation Models. U.S. Patent
5,510,981.
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer.
1993. The mathematics of statistical machine trans-
lation: Parameter estimation. Computational Linguis-
tics, 19(2).
E. Charniak. 2001. Immediate-head parsing for language
models. In ACL-01.
M. Collins. 1999. Head-Driven Statistical Models for
Natural Language Parsing. Ph.D. thesis, University
of Pennsylvania.
I. Langkilde. 2000. Forest-based statistical sentence gen-
eration. In NAACL-00.
F. Och and H. Ney. 2000. Improved statistical alignment
models. In ACL-2000.
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002.
BLEU: a method for automatic evaluation of machine
translation. In ACL-02.
D. Wu. 1997. Stochastic inversion transduction gram-
mars and bilingual parsing of parallel corpora. Com-
putational Linguistics, 23(3).
K. Yamada and K. Knight. 2001. A syntax-based statis-
tical translation model. In ACL-01.
K. Yamada. 2002. A Syntax-Based Statistical Transla-
tion Model. Ph.D. thesis, University of Southern Cali-
fornia.
