Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 961–968,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Scalable Inference and Training of
Context-Rich Syntactic Translation Models
Michel Galley*, Jonathan Graehl†, Kevin Knight†‡, Daniel Marcu†‡,
Steve DeNeefe†, Wei Wang‡ and Ignacio Thayer†
*Columbia University
Dept. of Computer Science
New York, NY 10027
galley@cs.columbia.edu, {graehl,knight,marcu,sdeneefe}@isi.edu,
wwang@languageweaver.com, thayer@google.com
†University of Southern California
Information Sciences Institute
Marina del Rey, CA 90292
‡Language Weaver, Inc.
4640 Admiralty Way
Marina del Rey, CA 90292
Abstract
Statistical MT has made great progress in the last
few years, but current translation models are weak
on re-ordering and target language fluency. Syn-
tactic approaches seek to remedy these problems.
In this paper, we take the framework for acquir-
ing multi-level syntactic translation rules of (Gal-
ley et al., 2004) from aligned tree-string pairs, and
present two main extensions of their approach: first,
instead of merely computing a single derivation that
minimally explains a sentence pair, we construct
a large number of derivations that include contex-
tually richer rules, and account for multiple inter-
pretations of unaligned words. Second, we pro-
pose probability estimates and a training procedure
for weighting these rules. We contrast different
approaches on real examples, show that our esti-
mates based on multiple derivations favor phrasal
re-orderings that are linguistically better motivated,
and establish that our larger rules provide a 3.63
BLEU point increase over minimal rules.
1 Introduction
While syntactic approaches seek to remedy word-
ordering problems common to statistical machine
translation (SMT) systems, many of the earlier
models—particularly child re-ordering models—
fail to account for human translation behavior.
Galley et al. (2004) alleviate this modeling prob-
lem and present a method for acquiring millions
of syntactic transfer rules from bilingual corpora,
which we review below. Here, we make the fol-
lowing new contributions: (1) we show how to
acquire larger rules that crucially condition on
more syntactic context, and show how to com-
pute multiple derivations for each training exam-
ple, capturing both large and small rules, as well
as multiple interpretations for unaligned words;
(2) we develop probability models for these multi-
level transfer rules, and give estimation methods
for assigning probabilities to very large rule sets.
We contrast our work with (Galley et al., 2004),
highlight some severe limitations of probability
estimates computed from single derivations, and
demonstrate that it is critical to account for many
derivations for each sentence pair. We also use
real examples to show that our probability mod-
els estimated from a large number of derivations
favor phrasal re-orderings that are linguistically
well motivated. An empirical evaluation against
a state-of-the-art SMT system similar to (Och and
Ney, 2004) indicates positive prospects. Finally,
we show that our contextually richer rules provide
a 3.63 BLEU point increase over those of (Galley
et al., 2004).
2 Inferring syntactic transformations
We assume we are given a source-language (e.g.,
French) sentence f, a target-language (e.g., En-
glish) parse tree pi, whose yield e is a translation
of f, and a word alignment a between f and e.
Our aim is to gain insight into the process of trans-
forming pi into f and to discover grammatically-
grounded translation rules. For this, we need
a formalism that is expressive enough to deal
with cases of syntactic divergence between source
and target languages (Fox, 2002): for any given
(pi,f,a) triple, it is useful to produce a derivation
that minimally explains the transformation be-
tween pi and f, while remaining consistent with a.
Galley et al. (2004) present one such formalism
(henceforth “GHKM”).
2.1 Tree-to-string alignments
It is appealing to model the transformation of pi
into f using tree-to-string (xRs) transducers, since
their theory has been worked out in an exten-
sive literature and is well understood (see, e.g.,
(Graehl and Knight, 2004)). Formally, transfor-
mational rules ri presented in (Galley et al., 2004)
are equivalent to 1-state xRs transducers mapping
a given pattern (subtree to match in pi) to a right
hand side string. We will refer to them as lhs(ri)
and rhs(ri), respectively. For example, some xRs
961
rules may describe the transformation of does not
intone ... pasinFrench. Aparticularinstancemay
look like this:
VP(AUX(does), RB(not), x0:VB) → ne, x0, pas
lhs(ri) can be any arbitrary syntax tree fragment.
Its leaves are either lexicalized (e.g. does) or vari-
ables (x0, x1, etc). rhs(ri) is represented as a se-
quence of target-language words and variables.
Now we give a brief overview of how such
transformational rules are acquired automatically
in GHKM.1 In Figure 1, the (pi,f,a) triple is rep-
resented as a directed graph G (edges going down-
ward), with no distinction between edges of pi and
alignments. Each node of the graph is labeled with
its span and complement span (the latter in italic
in the figure). The span of a node n is defined by
the indices of the first and last word in f that are
reachable from n. The complement span of n is
the union of the spans of all nodes nprime in G that
are neither descendants nor ancestors of n. Nodes
of G whose spans and complement spans are non-
overlapping form the frontier set F ∈ G.
What is particularly interesting about the fron-
tier set? For any frontier of graph G containing
a given node n ∈ F, spans on that frontier de-
fine an ordering between n and each other frontier
node nprime. For example, the span of VP[4-5] either
precedes or follows, but never overlaps the span of
any node nprime on any graph frontier. This property
does nothold fornodes outside of F. For instance,
PP[4-5] and VBG[4] are two nodes of the same
graph frontier, but they cannot be ordered because
of their overlapping spans.
The purpose of xRs rules in this framework is
to order constituents along sensible frontiers in G,
and all frontiers containing undefined orderings,
as between PP[4-5] and VBG[4], must be disre-
garded during rule extraction. To ensure that xRs
rules are prevented from attempting to re-order
any such pair of constituents, these rules are de-
signed in such a way that variables in their lhs can
only match nodes of the frontier set. Rules that
satisfy this property are said to be induced by G.2
For example, rule (d) in Table 1 is valid accord-
ing to GHKM, since the spans corresponding to
1Note that we use a slightly different terminology.
2Specifically, an xRs ruleri is extracted fromGby taking
a subtree γ ∈ pi as lhs(ri), appending a variable to each
leaf node of γ that is internal to pi, adding those variables to
rhs(ri), ordering them in accordance to a, and if necessary
inserting any word offto ensure thatrhs(ri)is a sequence of
contiguous spans (e.g., [4-5][6][7-8] for rule (f) in Table 1).
c44c54
c43c44
c56c42c50
c4ec4ec53
c49c4e
c4ec4ec50
c4ec50
c4ec4ec53
c56c42c47
c33
c32
c32
c31
c37c2dc38
c34
c34
c35
c39
c31
c32
c33
c34
c35
c36
c37
c38
c39
c33 c31c2dc32c2cc34c2dc39
c32 c31c2dc39
c32 c31c2dc39
c31 c32c2dc39
c37c2dc38 c31c2dc35c2cc39
c34 c31c2dc39
c34 c31c2dc39
c35 c31c2dc34c2cc37c2dc39
c39 c31c2dc38
c31c2dc32 c33c2dc39
c4ec50 c37c2dc38 c31c2dc35c2cc39
c4ec50 c35 c31c2dc34c2cc20c37c2dc39
c50c50 c34c2dc35 c31c2dc34c2cc37c2dc39
c56c50 c34c2dc35 c31c2dc33c2cc37c2dc39
c4ec50 c34c2dc38 c31c2dc33c2cc39
c56c50 c33c2dc38 c31c2dc32c2cc39
c53 c31c2dc39 c97
c37c21
c22c23c24
c25c26
c27c28
c29
c2ac2b
c2c
c2e
c54c68c65c73c65
c70c65c6fc70c6cc65
c69c6ec63c6cc75c64c65
c61c73c74c72c6fc6ec61c75c74c73
c63c6fc6dc69c6ec67
c66c72c6fc6d
c46c72c61c6ec63c65
c2ec2e
c37
c2d
Figure 1: Spans and complement-spans determine what
rules are extracted. Constituents in gray are members of the
frontier set; a minimal rule is extracted from each of them.
(a) S(x0:NP, x1:VP, x2:.) → x0, x1, x2
(b) NP(x0:DT, CD(7), NNS(people)) → x0, 7a186
(c) DT(these) →a217
(d) VP(x0:VBP, x1:NP) → x0, x1
(e) VBP(include) →a45a5a236
(f) NP(x0:NP, x1:VP) → x1,a132, x0
(g) NP(x0:NNS) → x0
(h) NNS(astronauts) →a135a42,a88
(i) VP(VBG(coming), PP(IN(from), x0:NP)) →a101a234, x0
(j) NP(x0:NNP) → x0
(k) NNP(France) →a213a253
(l) .(.) → .
Table 1: A minimal derivation corresponding to Figure 1.
its rhs constituents (VBP[3] and NP[4-8]) do not
overlap. Conversely,NP(x0:DT,x1:CD:,x2:NNS)
is not the lhs of any rule extractible from G, since
its frontier constituents CD[2] and NNS[2] have
overlapping spans.3 Finally, the GHKM proce-
dure produces a single derivation from G, which
is shown in Table 1.
The concern in GHKM was to extract minimal
rules, whereas ours is to extract rules of any arbi-
trary size. Minimal rules defined over G are those
that cannot be decomposed into simpler rules in-
duced by the same graph G, e.g., all rules in Ta-
ble 1. We call minimal a derivation that only con-
tains minimal rules. Conversely, a composed rule
results from the composition of two or more min-
imal rules, e.g., rule (b) and (c) compose into:
NP(DT(these), CD(7), NNS(people)) →a217, 7a186
3It is generally reasonable to also require that the root n
of lhs(ri) be part of F, because no rule induced by G can
compose with ri at n, due to the restrictions imposed on the
extraction procedure, and ri wouldn’t be part of any valid
derivation.
962
c4fc52
c4ec50c28c78c30c3ac4ec50c2cc20c78c31c3a
c56c50c29c20
c21c78c31
c2cc21
c2cc20c78c30
c56c50c28c78c30c3ac56c42c50c2cc20c78c31c3a
c4ec50c29c20
c21c78c30
c20c2cc20c78c31
c53c28c78c30c3ac4ec50c2cc20c78c31c3ac56c50c2cc20c78
c32c3ac2ec29c20
c21c78c30
c20c2cc20c78c31c2cc20c78
c32
c4ec50c28c78c30c3ac44c54c20c43c44c28c37c29c2cc20
c4ec4ec53c28c70c65c6fc70c6cc65c29c29c20
c21c78c30
c2cc20c37c22
c2ec28c2ec29c20 c21c2e
c44c54c28c74c68c65c73c65c29c20 c21c23
c56c42c50c28c69c6ec63c6cc75c64c65c29c20 c21
c24c25c26
c4ec50c28c78c30c3ac4ec50c2cc20c78c31c3a
c56c50c29c20
c21c78c31
c2cc20c78c30
c4ec50c28c78c30c3ac4ec50c2cc20c78c31c3a
c56c50c29c20
c21c78c31
c2cc20c78c30
c56c50c28
c56c42
c47c28c63c6f
c6dc69c6ec67
c29c2cc20
c50c50c28c49c4e
c28c66c72c6f
c6dc29c2cc20c78c30c3a
c4ec50c29c29c20c20
c21c27c28
c2cc20c78c30c2cc20
c21
c56c50c28
c56c42
c47c28c63c6f
c6dc69c6ec67
c29c2cc20
c50c50c28c49c4e
c28c66c72c6f
c6dc29c2cc20c78c30c3a
c4ec50c29c29c20
c21c27c28
c2cc20c78c30
c4ec50c28c78c30c3ac4ec4ec53c29c20 c21c78c30
c4ec50c28c78c30c3ac4ec4ec53c29c20 c21c21
c2cc20c78c30
c4ec50c28c78c30c3ac4ec4ec50c29c20 c21c78c30
c2cc20c21
c4ec4ec50c28c46c72c61c6ec63c65c29c20 c21c29c2a
c4ec4ec53c28c61c73c74c72c6fc6ec61c75c74c73c29c20 c21c2bc2c
c2cc20c2d
c4fc52
c4fc52 c4ec4ec53c28c61c73c74c72c6fc6ec61c75c74c73c29c20 c21c21
c2cc2bc2c
c2cc20c2d
c4fc52
c4ec50c28c78c30c3ac4ec4ec50c29c20 c21c78c30
c4ec50c28c78c30c3ac4ec4ec50c29c20 c21c78c30
c4ec4ec50c28c46c72c61c6ec63c65c29c20 c21c29c2a
c2cc20c21
c4ec50c28c78c30c3ac4ec4ec53c29c20 c21c78c30
c56c50c28
c56c42
c47c28c63c6f
c6dc69c6ec67
c29c2cc20
c50c50c28c49c4e
c28c66c72c6f
c6dc29c2cc20c78c30c3a
c4ec50c29c29c20
c21c27c28
c2cc20c78c30
c63c6fc6d
c69c6ec67
c66c72c6fc6d
c4ec4e
c53
c49c4e
c4ec4e
c50
c4ec50
c56c50
c4ec50
c56c42c47
c50c50
c4ec50
c37c2dc38
c35
c37c2dc38
c35
c37c2dc38
c34
c34
c35
c34
c35
c36
c37
c38
c34
c34
c34c2dc35
c34c2dc35
c34c2dc38
c4ec4ec50c28c46c72c61c6ec63c65c29c20 c21c29c2a
c2cc20c21
c4ec50c28c78c30c3ac4ec4ec50c29c20 c21c78c30
c2cc20c21
c56c50c28c56c42c47c28c63c6f
c6dc69c6ec67c29c2cc20
c50c50c28c49c4e
c28c66c72c6fc6d
c29c2cc20
c78c30c3ac4ec50c29c29c20c20 c21c27c28
c2cc20c78c30c2cc20
c21
c4ec4ec53c28c61c73c74c72c6fc6ec61c75c74c73c29c20 c21c21
c2cc20c2bc2c
c2cc20c2d
c4ec50c28c78c30c3ac4ec4ec53c29c20 c21c21
c2cc20c78c30
c4ec50c28c78c30c3ac4ec50c2cc20c78c31c3a
c56c50c29c20
c21c78c31
c2cc20c21
c2cc20c78c30
c28c61c29
c28c62c29
c2d
c27c28
c29c2a
c21
c2bc2c
c61c73c74c72c6f
c6ec61c75c74
c73
c46c72c61c6ec63c65
Figure 2: (a) Multiple ways of aligninga132to constituents in the tree. (b) Derivation corresponding to the parse tree in Figure 1,
which takes into account all alignments ofa132pictured in (a).
Note that these properties are dependent on G, and
the above rule would be considered a minimal rule
in a graph Gprime similar to G, but additionally con-
taining a word alignment between 7 and a217. We
will see in Sections 3 and 5 why extracting only
minimal rules can be highly problematic.
2.2 Unaligned words
While the general theory presented in GHKM ac-
counts for any kind of derivation consistent with
G, it does not particularly discuss the case where
some words of the source-language string f are
not aligned to any word of e, thus disconnected
from the rest of the graph. This case is highly fre-
quent: 24.1% of Chinese words in our 179 mil-
lion word English-Chinese bilingual corpus are
unaligned, and 84.8% of Chinese sentences con-
tain at least one unaligned word. The question is
what to do with such lexical items, e.g., a132 in
Figure 2(a). The approach of building one mini-
mal derivation for G as in the algorithm described
in GHKM assumes that we commit ourselves to
a particular heuristic to attach the unaligned item
to a certain constituent of pi, e.g., highest attach-
ment (in the example, a132 is attached to NP[4-8]
and the heuristic generates rule (f)). A more rea-
sonable approach is to invoke the principle of in-
sufficient reason and make no a priori assump-
tion about what is a “correct” way of assigning
the item to a constituent, and return all derivations
that are consistent with G. In Section 4, we will
see how to use corpus evidence to give preference
to unaligned-word attachments that are the most
consistent across the data. Figure 2(a) shows the
six possible ways of attaching a132 to constituents
of pi: besides the highest attachment (rule (f)),a132
can move along the ancestors of France, since it is
to the right of the translation of that word, and be
considered to be part of an NNP, NP, or VP rule.
We make the same reasoning to the left: a132 can
either start the NNS of astronauts, or start an NP.
Our account of all possible ways of consistently
attaching a132 to constituents means we must ex-
tract more than one derivation to explain transfor-
mations in G, even if we still restrict ourselves to
minimal derivations (a minimal derivation for G
is unique if and only if no source-language word
in G is unaligned). While we could enumerate
all derivations separately, it is much more effi-
cient both in time and space to represent them as a
derivation forest, as in Figure 2(b). Here, the for-
est covers all minimal derivations that correspond
to G. It is necessary to ensure that for each deriva-
tion, each unaligned item (here a132) appears only
once in the rules of that derivation, as shown in
Figure 2 (which satisfies the property). That re-
quirement will prove to be critical when we ad-
dress the problem of estimating probabilities for
our rules: if we allowed in our example to spuri-
ously generatea132s in multiple successive steps of
the same derivation, we would not only represent
the transformation incorrectly, but also a132-rules
would be disproportionately represented, leading
to strongly biased estimates. We will now see how
to ensure this constraint is satisfied in our rule ex-
traction and derivation building algorithm.
963
2.3 Algorithm
The linear-time algorithm presented in GHKM is
only a particular case of the more general one we
describe here, which is used to extract all rules,
minimal and composed, induced by G. Similarly
to the GHKM algorithm, ours performs a top-
down traversal of G, but differs in the operations
it performs at each node n ∈ F: we must explore
all subtrees rooted at n, find all consistent ways
of attaching unaligned words of f, and build valid
derivations in accordance to these attachments.
We use a table or-dforest[x,y,c] to store OR-
nodes, in which each OR-node can be uniquely
defined by a syntactic category c and a span [x,y]
(which may cover unaligned words of f). This ta-
ble is used to prevent the same partial derivation
to be followed multiple times (the in-degrees of
OR-nodes generally become large with composed
rules). Furthermore, to avoid over-generating un-
aligned words, the root and variables in each rule
are represented with their spans. For example, in
Figure 2(b), the second and third child of the top-
most OR-node respectively span across [4-5][6-8]
and [4-6][7-8] (after constituent reordering). In
the former case, a132 will eventually be realized in
an NP, and in the latter case, in a VP.
The preprocessing step consists of assigning
spans and complement spans to nodes of G, in
the first case by a bottom-up exploration of the
graph, and in the latter by a top-down traversal.
To assign complement spans, we assign the com-
plement span of any node n to each of its children,
and for each of them, add the span of the child
to the complement span of all other children. In
another traversal of G, we determine the minimal
rule extractible from each node in F.
We explore all tree fragments rooted at n by
maintaining an open and a closed queue of rules
extracted from n (qo and qc). At each step, we
pick the smallest rule in qo, and for each of its
variable nodes, try to discover new rules (‘succes-
sor rules’) by means of composition with minimal
rules, until a given threshold on rule size or maxi-
mum number of rules in qc is reached. There may
be more that one successor per rule, since we must
account for all possible spans than can be assigned
to non-lexical leaves of a rule. Once a threshold is
reached, or if the open queue is empty, we connect
a new OR-node to all rules that have just been ex-
tracted from n, and add it to or-dforest. Finally,
weproceedrecursively, andextractnewrulesfrom
eachnodeatthefrontieroftheminimalrulerooted
at n. Once all nodes of F have been processed, the
or-dforest table contains a representation encod-
ing only valid derivations.
3 Probability models
The overall goal of our translation system is to
transform a given source-language sentence f
into an appropriate translation e in the set E
of all possible target-language sentences. In a
noisy-channel approach to SMT, we uses Bayes’
theorem and choose the English sentence ˆe ∈ E
that maximizes:4
ˆe = argmax
e∈E
braceleftBig
Pr(e) · Pr(f|e)
bracerightBig
(1)
Pr(e) is our language model, and Pr(f|e) our
translation model. In a grammatical approach to
MT, we hypothesize that syntactic information
can help produce good translation, and thus
introduce dependencies on target-language syntax
trees. The function to optimize becomes:
ˆe = argmax
e∈E
braceleftBig
Pr(e)·
summationdisplay
pi∈τ(e)
Pr(f|pi)·Pr(pi|e)
bracerightBig
(2)
τ(e) is the set of all English trees that yield the
given sentence e. Estimating Pr(pi|e) is a prob-
lem equivalent to syntactic parsing and thus is not
discussed here. Estimating Pr(f|pi) is the task of
syntax-based translation models (SBTM).
Given a rule set R, our SBTM makes the
common assumption that left-most compositions
of xRs rules θi = r1 ◦ ... ◦ rn are independent
from one another in a given derivation θi ∈ Θ,
where Θ is the set of all derivations constructible
from G = (pi,f,a) using rules of R. Assuming
that Λ is the set of all subtree decompositions of pi
corresponding to derivations in Θ, we define the
estimate:
Pr(f|pi) = 1|Λ|
summationdisplay
θi∈Θ
productdisplay
rj∈θi
p(rhs(rj)|lhs(rj)) (3)
under the assumption:
summationdisplay
rj∈R:lhs(rj)=lhs(ri)
p(rhs(rj)|lhs(rj)) = 1 (4)
It is important to notice that the probability
distribution defined in Equation 3 requires a
normalization factor (|Λ|) in order to be tight, i.e.,
sum to 1 over all strings fi ∈ F that can be derived
4We denote general probability distributions with Pr(·)
and use p(·) for probabilities assigned by our models.
964
c58
c61
c59 c62
c61c92
c62c92
c63c92c63
c28c21c2cc66 c31c2c
c61 c31c29c3a
c58
c61
c59 c62
c62c92
c61c92
c63c92c63
c28c21c2cc66 c32c2c
c61 c32c29c3a
Figure 3: Example corpus.
from pi. A simple example suffices to demonstrate
it is not tight without normalization. Figure 3
contains a sample corpus from which four rules
can be extracted:
r1: X(a, Y(b, c)) → a’, b’, c’
r2: X(a, Y(b, c)) → b’, a’, c’
r3: X(a, x0:Y) → a’, x0
r4: Y(b, c) → b’, c’
From Equation 4, the probabilities of r3 and r4
must be 1, and those of r1 and r2 must sum to
1. Thus, the total probability mass, which is dis-
tributed across two possible output strings a’b’c’
and b’a’c’, is: p(a’b’c’|pi) + p(b’a’c’|pi) = p1 +
p3 · p4 + p2 = 2, where pi = p(rhs(ri)|lhs(ri)).
It is relatively easy to prove that the probabil-
ities of all derivations that correspond to a given
decomposition λi ∈ Λ sum to 1 (the proof is omit-
ted due to constraints on space). From this prop-
erty we can immediately conclude that the model
described by Equation 3 is tight.5
We examine two estimates p(rhs(r)|lhs(r)).
The first one is the relative frequency estimator
conditioning on left hand sides:
p(rhs(r)|lhs(r)) = f(r)summationtext
rprime:lhs(rprime)=lhs(r) f(rprime)
(5)
f(r) represents the number of times rule r oc-
curred in the derivations of the training corpus.
One of the major negative consequences of
extracting only minimal rules from a corpus is
that an estimator such as Equation 5 can become
extremely biased. This again can be observed
from Figure 3. In the minimal-rule extraction of
GHKM, only three rules are extracted from the ex-
ample corpus, i.e. rules r2, r3, and r4. Let’s as-
sume now that the triple (pi,f1,a1) is represented
99 times, and (pi,f2,a2) only once. Given a tree
pi, the model trained on that corpus can generate
the two strings a’b’c’ and b’a’c’ only through two
derivations, r3 ◦ r4 and r2, respectively. Since
all rules in that example have probability 1, and
5If each tree fragment in pi is the lhs of some rule in R,
then we have |Λ| = 2n, where n is the number of nodes of
the frontier set F ∈ G (each node is a binary choice point).
given that the normalization factor |Λ| is 2, both
probabilities p(a’b’c’|pi) and p(b’a’c’|pi) are 0.5.
On the other hand, if all rules are extracted and
incorporated into our relative-frequency probabil-
ity model, r1 seriously counterbalances r2 and the
probabilityof a’b’c’becomes: 12·( 99100+1) = .995
(since it differs from .99, the estimator remains bi-
ased, but to a much lesser extent).
An alternative to the conditional model of
Equation 3 is to use a joint model conditioning on
the root node instead of the entire left hand side:
p(r|root(r)) = f(r)summationtext
rprime:root(rprime)=root(r) f(rprime)
(6)
This can be particularly useful if no parser or
syntax-based language model is available, and we
need to rely on the translation model to penalize
ill-formed parse trees. Section 6 will describe an
empirical evaluation based on this estimate.
4 EM training
In our previous discussion of parameter estima-
tion, we did not explore the possibility that one
derivation in a forest may be much more plau-
sible than the others. If we knew which deriva-
tion in each forest was the “true” derivation, then
we could straightforwardly collect rule counts off
those derivations. On the other hand, if we had
good rule probabilities, we could compute the
most likely (Viterbi) derivations for each training
example. This is a situation in which we can em-
ploy EM training, starting with uniform rule prob-
abilities. Foreachtrainingexample, wewouldlike
to: (1) score each derivation θi as a product of the
probabilities of the rules it contains, (2) compute
a conditional probability pi for each derivation θi
(conditioned on the observed training pair) by nor-
malizing those scores to add to 1, and (3) collect
weighted counts for each rule in each θi, where
the weight is pi. We can then normalize the counts
to get refined probabilities, and iterate; the corpus
likelihood is guaranteed to improve with each it-
eration. While it is infeasible to enumerate the
millions of derivations in each forest, Graehl and
Knight (2004) demonstrate an efficient algorithm.
They also analyze how to train arbitrary tree trans-
ducers into two steps. The first step is to build a
derivation forest for each training example, where
the forest contains those derivations licensed by
the (already supplied) transducer’s rules. The sec-
ond step employs EM on those derivation forests,
running in time proportional to the size of the
965
Best minimal-rule derivation (Cm) p(r)
(a) S(x0:NP-C x1:VP x2:.) → x0 x1 x2 .845
(b) NP-C(x0:NPB) → x0 .82
(c) NPB(DT(the) x0:NNS) → x0 .507
(d) NNS(gunmen) →a170a75 .559
(e) VP(VBD(were) x0:VP-C) → x0 .434
(f) VP-C(x0:VBN x1:PP) → x1 x0 .374
(g) PP(x0:IN x1:NP-C) → x0 x1 .64
(h) IN(by) →a171 .0067
(i) NP-C(x0:NPB) → x0 .82
(j) NPB(DT(the) x0:NN) → x0 .586
(k) NN(police) →a102a185 .0429
(l) VBN(killed) →a251a217 .0072
(m) .(.) → . .981
c2ec20
c54c68c65c67
c75c6ec6dc65c6e
c77c65c72c65
c6bc69c6cc6cc65c64
c62c79c74
c68c65c70c6fc6cc69c63c65
c2e
c44c54
c56c42c44
c56c42c4e
c44c54
c4ec4e
c4ec50
c50c50
c56c50c2dc43
c56c50c53
c4ec4ec53
c49c4e
c4ec50
c2e
c21c22
c23c24
c25
c26c27
Best composed-rule derivation (C4) p(r)
(o) S(NP-C(NPB(DT(the) NNS(gunmen))) x0:VP .(.)) →a170a75x0 . 1
(p) VP(VBD(were) VP-C(x0:VBN PP(IN(by) x1:NP-C))) →a171x1 x0 0.00724
(q) NP-C(NPB(DT(the) NN(police))) →a102a185 0.173
(r) VBN(killed) →a251a217 0.00719
Figure 4: Two most probable derivations for the graph on the right: the top table restricted to minimal rules; the bottom one,
much more probable, using a large set of composed rules. Note: the derivations are constrained on the (pi,f,a) triple, and thus
include some non-literal translations with relatively low probabilities (e.g. killed, which is more commonly translated asa123a161).
rule nb. of nb. of deriv- EM-
set rules nodes time time
Cm 4M 192M 2 h. 4 h.
C3 142M 1255M 52 h. 34 h.
C4 254M 2274M 134 h. 60 h.
Table 2: Rules and derivation nodes for a 54M-word, 1.95M
sentence pair English-Chinese corpus, and time to build
derivations (on 10 cluster nodes) and run 50 EM iterations.
forests. We only need to borrow the second step
for our present purposes, as we construct our own
derivation forests when we acquire our rule set.
A major challenge is to scale up this EM train-
ing to large data sets. We have been able to run
EM for 50 iterations on our Chinese-English 54-
million word corpus. The derivation forests for
this corpus contain 2.2 billion nodes; the largest
forest contains 1.1 million nodes. The outcome
is to assign probabilities to over 254 million rules.
Our EM runs with either lhs normalization or lhs-
root normalization. In the former case, each lhs
has an average of three corresponding rhs’s that
compete with each other for probability mass.
5 Model coverage
We now present some examples illustrating the
benefit of composed rules. We trained three
p(rhs(ri)|lhs(ri)) models on a 54 million-word
English-Chineseparallelcorpus(Table2): thefirst
one (Cm) with only minimal rules, and the two
others (C3 and C4) additionally considering com-
posed rules with no more than three, respectively
four, internal nodes in lhs(ri). We evaluated these
models on a section of the NIST 2002 evaluation
corpus, for which we built derivation forests and
lhs: S(x0:NP-C VP(x1:VBD x2:NP-C) x3:.)
corpus rhsi p(rhsi|lhs)
Chinese x1 x0 x2 x3 .3681
(minimal) x0 x1 a44x3 x2 .0357
x2 , x0 x1 x3 .0287
x0 x1 a44x3 x2 . .0267
Chinese x0 x1 x2 x3 .9047
(composed) x0 x1 , x2 x3 .016
x0 , x1 x2 x3 .0083
x0 x1 a134x2 x3 .0072
Arabic x1 x0 x2 x3 .5874
(composed) x0 x1 x2 x3 .4027
x1 x2 x0 x3 .0077
x1 x0 x2 " x3 .0001
Table 3: Our model transforms English subject-verb-object
(SVO) structures into Chinese SVO and into Arabic VSO.
With only minimal rules, Chinese VSO is wrongly preferred.
extracted the most probable one (Viterbi) for each
sentence pair (based on an automatic alignment
produced by GIZA). We noticed in general that
Viterbi derivations according to C4 make exten-
sive usage of composed rules, as it is the case in
the example in Figure 4. It shows the best deriva-
tion according to Cm and C4 on the unseen (pi,f,a)
triple displayed on the right. The second deriva-
tion (logp = −11.6) is much more probable than
the minimal one (logp = −17.7). In the case
of Cm, we can see that many small rules must be
applied to explain the transformation, and at each
step, thedecisionregardingthere-orderingofcon-
stituents is made with little syntactic context. For
example, from the perspective of a decoder, the
word by is immediately transformed into a prepo-
sition (IN), but it is in general useful to know
which particular function word is present in the
sentence to motivate good re-orderings in the up-
966
lhs1: NP-C(x0:NPB PP(IN(of) x1:NP-C)) (NP-of-NP)
lhs2: PP(IN(of) NP-C(x0:NPB PP(IN(of) NP-C(x1:NPB x2:VP)))) (of-NP-of-NP-VP)
lhs3: VP(VBD(said) SBAR-C(IN(that) x0:S-C)) (said-that-S)
lhs4: SBAR(WHADVP(WRB(when)) S-C(x0:NP-C VP(VBP(are) x1:VP-C))) (when-NP-are-VP)
rhs1i p(rhs1i|lhs1) rhs2i p(rhs2i|lhs2) rhs3i p(rhs3i|lhs3) rhs4i p(rhs4i|lhs4)
x1 x0 .54 x2 a132x1 a132x0 .6754 a244 , x0 .6062 a40x1 x0 a246 .6618
x0 x1 .2351 a40x2 a132x1 a132x0 .035 a244x0 .1073 a83x1 x0 a246 .0724
x1 a132x0 .0334 x2 a132x1 a132x0 , .0263 a104a58 , x0 .0591 a40x1 x0 a246 , .0579
x1 x0 a132 .026 x2 a132x1 a132x0 a9 .0116 a214a244 , x0 .0234 , a40x1 x0 a246 .0289
Table 4: Translation probabilities promote linguistically motivated constituent re-orderings (for lhs1 and lhs2), and enable
non-constituent (lhs3) and non-contiguous (lhs4) phrasal translations.
per levels of the tree. A rule like (e) is particu-
larly unfortunate, since it allows the word were to
be added without any other evidence that the VP
should be in passive voice. On the other hand, the
composed-rule derivation of C4 incorporates more
linguistic evidence in its rules, and re-orderings
are motivated by more syntactic context. Rule
(p) is particularly appropriate to create a passive
VP construct, since it expects a Chinese passive
marker (a171), an NP-C, and a verb in its rhs, and
creates the were ... by construction at once in the
left hand side.
5.1 Syntactic translation tables
We evaluate the promise of our SBTM by analyz-
inginstancesoftranslationtables(t-table). Table3
shows how a particular form of SVO construc-
tion is transformed into Chinese, which is also an
SVOlanguage. Whilethet-tableforChinesecom-
posed rules clearly gives good estimates for the
“correct” x0 x1 ordering (p = .9), i.e. subject be-
fore verb, the t-table for minimal rules unreason-
ably gives preference to verb-subject ordering (x1
x0, p = .37), because the most probable transfor-
mation (x0 x1) does not correspond to a minimal
rule. We obtain different results with Arabic, an
VSO language, and our model effectively learns
to move the subject after the verb (p = .59).
lhs1 in Table 4 shows that our model is able
to learn large-scale constituent re-orderings, such
as re-ordering NPs in a NP-of-NP construction,
and put the modifier first as it is more commonly
the case in Chinese (p = .54). If more syntac-
tic context is available as in lhs2, our model
provides much sharper estimates, and appropri-
ately reverses the order of three constituents with
highprobability(p = .68),insertingmodifiersfirst
(possessive markersa132are needed here for better
syntactic disambiguation).
A limitation of earlier syntax-based systems is
their poor handling of non-constituent phrases.
Table 4 shows that our model can learn rules for
such phrases, e.g., said that (lhs3). While the that
has no direct translation, our model effectively
learnstoseparatea244(said)fromtherelativeclause
with a comma, which is common in Chinese.
Anotherpromisingprospectofourmodelseems
to lie in its ability to handle non-contiguous
phrases, a feature that state of the art systems
such as (Och and Ney, 2004) do not incorpo-
rate. The when-NP-are-VP construction of lhs4
presents such a case. Our model identifies that are
needs to be deleted, that when translates into the
phrasea40...a246,andthattheNPneedstobemoved
after the VP in Chinese (p = .66).
6 Empirical evaluation
The task of our decoder is to find the most likely
English tree pi that maximizes all models involved
in Equation 2. Since xRs rules can be converted to
context-free productions by increasing the number
of non-terminals, we implemented our decoder as
a standard CKY parser with beam search. Its rule
binarization is described in (Zhang et al., 2006).
We compare our syntax-based system against
an implementation of the alignment template
(AlTemp) approach to MT (Och and Ney, 2004),
which is widely considered to represent the state
of the art in the field. We registered both systems
in the NIST 2005 evaluation; results are presented
in Table 5. With a difference of 6.4 BLEU points
for both language pairs, we consider the results
of our syntax-based system particularly promis-
ing, since these are the highest scores to date that
we know of using linguistic syntactic transforma-
tions. Also, on the one hand, our AlTemp sys-
tem represents quite mature technology, and in-
corporates highly tuned model parameters. On
the other hand, our syntax decoder is still work in
progress: only one model was used during search,
i.e., the EM-trained root-normalized SBTM, and
as yet no language model is incorporated in the
search (whereas the search in the AlTemp sys-
tem uses two phrase-based translation models and
967
Syntactic AlTemp
Arabic-to-English 40.2 46.6
Chinese-to-English 24.3 30.7
Table 5: BLEU-4 scores for the 2005 NIST test set.
Cm C3 C4
Chinese-to-English 24.47 27.42 28.1
Table6: BLEU-4scoresforthe2002NISTtestset, withrules
of increasing sizes.
12 other feature functions). Furthermore, our de-
coder doesn’t incorporate any syntax-based lan-
guage model, and admittedly our ability to penal-
ize ill-formed parse trees is still limited.
Finally, we evaluated our system on the NIST-
02 test set with the three different rule sets (see
Table 6). The performance with our largest rule
set represents a 3.63 BLEU point increase (14.8%
relative) compared to using only minimal rules,
which indicates positive prospects for using even
larger rules. While our rule inference algorithm
scales to higher thresholds, one important area of
future work will be the improvement of our de-
coder, conjointly with analyses of the impact in
terms of BLEU of contextually richer rules.
7 Related work
Similarly to (Poutsma, 2000; Wu, 1997; Yamada
and Knight, 2001; Chiang, 2005), the rules dis-
cussed in this paper are equivalent to productions
of synchronous tree substitution grammars. We
believe that our tree-to-string model has several
advantages over tree-to-tree transformations such
as the ones acquired by Poutsma (2000). While
tree-to-tree grammars are richer formalisms that
provide the potential benefit of rules that are lin-
guistically better motivated, modeling the syntax
of both languages comes as an extra cost, and it
is admittedly more helpful to focus our syntac-
tic modeling effort on the target language (e.g.,
English) in cases where it has syntactic resources
(parsers and treebanks) that are considerably more
available than for the source language. Further-
more, we think there is, overall, less benefit in
modeling the syntax of the source language, since
the input sentence is fixed during decoding and is
generally already grammatical.
With the notable exception of Poutsma, most
related works rely on models that are restricted
to synchronous context-free grammars (SCFG).
While the state-of-the-art hierarchical SMT sys-
tem (Chiang, 2005) performs well despite strin-
gent constraints imposed on its context-free gram-
mar, we believe its main advantage lies in its
ability to extract hierarchical rules across phrasal
boundaries. Context-free grammars (such as Penn
Treebank and Chiang’s grammars) make indepen-
dence assumptions that are arguably often unrea-
sonable, but as our work suggests, relaxations
of these assumptions by using contextually richer
rules results in translations of increasing quality.
We believe it will be beneficial to account for this
finding in future work in syntax-based SMT and in
efforts to improve upon (Chiang, 2005).
8 Conclusions
In this paper, we developed probability models for
the multi-level transfer rules presented in (Galley
et al., 2004), showed how to acquire larger rules
that crucially condition on more syntactic context,
and how to pack multiple derivations, including
interpretations of unaligned words, into derivation
forests. We presented some theoretical arguments
for not limiting extraction to minimal rules, val-
idated them on concrete examples, and presented
experiments showing that contextually richer rules
provide a 3.63 BLEU point increase over the min-
imal rules of (Galley et al., 2004).
Acknowledgments
We would like to thank anonymous review-
ers for their helpful comments and suggestions.
This work was partially supported under the
GALE program of the Defense Advanced Re-
search Projects Agency, Contract No. HR0011-
06-C-0022.
References
D. Chiang. 2005. A hierarchical phrase-based model for
statistical machine translation. In Proc. of ACL.
H.Fox. 2002. Phrasalcohesionandstatisticalmachinetrans-
lation. In Proc. of EMNLP, pages 304–311.
M. Galley, M. Hopkins, K. Knight, and D. Marcu. 2004.
What’s in a translation rule? In Proc. of HLT/NAACL-04.
J. Graehl and K. Knight. 2004. Training tree transducers. In
Proc. of HLT/NAACL-04, pages 105–112.
F. Och and H. Ney. 2004. The alignment template approach
to statistical machine translation. Computational Linguis-
tics, 30(4):417–449.
A. Poutsma. 2000. Data-oriented translation. In Proc. of
COLING, pages 635–641.
D. Wu. 1997. Stochastic inversion transduction grammars
and bilingual parsing of parallel corpora. Computational
Linguistics, 23(3):377–404.
K. Yamada and K. Knight. 2001. A syntax-based statistical
translation model. In Proc. of ACL, pages 523–530.
H. Zhang, L. Huang, D. Gildea, and K. Knight. 2006. Syn-
chronous binarization for machine translation. In Proc. of
HLT/NAACL.
968
