Statistical Machine Translation by Parsing
I. Dan Melamed
Computer Science Department
New York University
New York, NY, U.S.A.
10003-6806
a0 lastname
a1 @cs.nyu.edu
Abstract
In an ordinary syntactic parser, the input is a string,
and the grammar ranges over strings. This paper
explores generalizations of ordinary parsing algo-
rithms that allow the input to consist of string tu-
ples and/or the grammar to range over string tu-
ples. Such algorithms can infer the synchronous
structures hidden in parallel texts. It turns out that
these generalized parsers can do most of the work
required to train and apply a syntax-aware statisti-
cal machine translation system.
1 Introduction
A parser is an algorithm for inferring the structure
of its input, guided by a grammar that dictates what
structures are possible or probable. In an ordinary
parser, the input is a string, and the grammar ranges
over strings. This paper explores generalizations of
ordinary parsing algorithms that allow the input to
consist of string tuples and/or the grammar to range
over string tuples. Such inference algorithms can
perform various kinds of analysis on parallel texts,
also known as multitexts.
Figure 1 shows some of the ways in which ordi-
nary parsing can be generalized. A synchronous
parser is an algorithm that can infer the syntactic
structure of each component text in a multitext and
simultaneously infer the correspondence relation
between these structures.1 When a parser’s input
can have fewer dimensions than the parser’s gram-
mar, we call it a translator. When a parser’s gram-
mar can have fewer dimensions than the parser’s
input, we call it a synchronizer. The corre-
sponding processes are called translation and syn-
chronization. To our knowledge, synchronization
has never been explored as a class of algorithms.
Neither has the relationship between parsing and
word alignment. The relationship between trans-
lation and ordinary parsing was noted a long time
1A suitable set of ordinary parsers can also infer the syntac-
tic structure of each component, but cannot infer the correspon-
dence relation between these structures.
translation
synchronization
synchronous parsing
1
parsing
32
2
3
1
...
...
ordinary
I = dimensionality of input
D = dimensionality of grammar
synchronization
(I >= D)
parsing
synchronous
(D=I)
word
alignment
translation
(D >= I)
ordinary
parsing
(D=I=1)
generalized parsing
(any D; any I)
Figure 1: Generalizations of ordinary parsing.
ago (Aho & Ullman, 1969), but here we articu-
late it in more detail: ordinary parsing is a spe-
cial case of synchronous parsing, which is a special
case of translation. This paper offers an informal
guided tour of the generalized parsing algorithms in
Figure 1. It culminates with a recipe for using these
algorithms to train and apply a syntax-aware statis-
tical machine translation (SMT) system.
2 Multitext Grammars and Multitrees
The algorithms in this paper can be adapted for any
synchronous grammar formalism. The vehicle for
the present guided tour shall be multitext grammar
(MTG), which is a generalization of context-free
grammar to the synchronous case (Melamed, 2003).
We shall limit our attention to MTGs in Generalized
Chomsky Normal Form (GCNF) (Melamed et al.,
2004). This normal form allows simpler algorithm
descriptions than the normal forms used by Wu
(1997) and Melamed (2003).
In GCNF, every production is either a terminal
production or a nonterminal production. A nonter-
minal production might look like this:
a2a3
a4a6a5a8a7
a9a11a10a13a12a15a14a17a16
a9a11a10a18a16
a9a11a10a13a12a15a14a19a12a20a10a18a16
a21a22 A
a23a25a24
D(2)
B
a26
E
a27a28
(1)
There are nonterminals on the left-hand side (LHS)
and in parentheses on the right-hand side (RHS).
Each row of the production describes rewriting in
a different component text of a multitext. In each
row, a role template describes the relative order
and contiguity of the RHS nonterminals. E.g., in
the top row, [1,2] indicates that the first nonter-
minal (A) precedes the second (B). In the bottom
row, [1,2,1] indicates that the first nonterminal both
precedes and follows the second, i.e. D is discon-
tinuous. Discontinuous nonterminals are annotated
with the number of their contiguous segments, as in
a0
a23
a14
a24
. The
a7
(“join”) operator rearranges the non-
terminals in each component according to their role
template. The nonterminals on the RHS are writ-
ten in columns called links. Links express transla-
tional equivalence. Some nonterminals might have
no translation in some components, indicated by (),
as in the 2nd row. Terminal productions have ex-
actly one “active” component, in which there is ex-
actly one terminal on the RHS. The other compo-
nents are inactive. E.g.,
a23a25a24
a3
a5
a23a25a24
a1
(2)
The semantics of
a5
are the usual semantics of
rewriting systems, i.e., that the expression on the
LHS can be rewritten as the expression on the RHS.
However, all the nonterminals in the same link must
be rewritten simultaneously. In this manner, MTGs
generate tuples of parse trees that are isomorphic up
to reordering of sibling nodes and deletion. Figure 2
shows two representations of a tree that might be
generated by an MTG in GCNF for the imperative
sentence pair Wash the dishes / Pasudu moy . The
tree exhibits both deletion and inversion in transla-
tion. We shall refer to such multidimensional trees
as multitrees.
The different classes of generalized parsing al-
gorithms in this paper differ only in their gram-
mars and in their logics. They are all compatible
with the same parsing semirings and search strate-
gies. Therefore, we shall describe these algorithms
in terms of their underlying logics and grammars,
abstracting away the semirings and search strate-
gies, in order to elucidate how the different classes
of algorithms are related to each other. Logical de-
scriptions of inference algorithms involve inference
rules:
a2a4a3a5
a6 means that
a2 can be inferred from
a3
and a4 . An item that appears in an inference rule
stands for the proposition that the item is in the parse
chart. A production rule that appears in an inference
rule stands for the proposition that the production is
in the grammar. Such specifications are nondeter-
a7a9a8a11a10a13a12a15a14a17a16a19a18
a8a11a10a16a11a14a20a12a20a18a22a21
a7a24a23
a23 a21
a7a24a25a27a26a29a28a31a30
a32a9a33
a21
a7 Wash
a32a9a33
a21
a7
a32a9a33
a34a27a35a37a36
a21
a7
a32a9a33
moya21
a7a39a38a41a40 a10a13a12a15a14a42a16a43a18
a38a41a40 a21
a7a39a44
a32a9a33
a21
a7 the
a32a9a33
a21
a7a39a38
a38a45a21
a7a39a44
a35
a28a46a30
a32a9a33
a21
a7 dishes
a32a9a33
a21
a7
a32a9a33
a40a47a26a29a28a48a21
a7
a32a9a33
Pasudua21
Figure 2: Above: A tree generated by a 2-MTG
in English and (transliterated) Russian. Every in-
ternal node is annotated with the linear order of
its children, in every component where there are
two children. Below: A graphical representation
of the same tree. Rectangles are 2D constituents.
dishesthe Wash 
moy
Pasudu
S
NP
NV
WASH D DISH
PAS
MITV
NNP
S
ministic: they do not indicate the order in which a
parser should attempt inferences. A deterministic
parsing strategy can always be chosen later, to suit
the application. We presume that readers are famil-
iar with declarative descriptions of inference algo-
rithms, as well as with semiring parsing (Goodman,
1999).
3 A Synchronous CKY Parser
Figure 3 shows Logic C. Parser C is any
parser based on Logic C. As in Melamed
(2003)’s Parser A, Parser C’s items consist
of a a0 -dimensional label vector a2a50a49a51 and a
a0 -dimensional d-span vector
a52
a49
a51 .
2 The items con-
tain d-spans, rather than ordinary spans, because
2Superscripts and subscripts indicate the range of dimen-
sions of a vector. E.g., a53a45a54a55 is a vector spanning dimensions 1
through a56 . See Melamed (2003) for definitions of cardinality,
d-span, and the operators a57 and a58 .
Parser C needs to know all the boundaries of each
item, not just the outermost boundaries. Some (but
not all) dimensions of an item can be inactive, de-
noted
a23a25a24
, and have an empty d-span ().
The input to Parser C is a tuple of a0 parallel texts,
with lengths a0 a49 a12 a1a2a1a2a1 a12 a0 a51 . The notation
a23a4a3
a12
a0a6a5
a24
a49
a51 in-
dicates that the Goal item must span the input from
the left of the first word to the right of the last word
in each component a7 a12a20a10a9a8 a7 a8 a0 . Thus, the Goal
item must be contiguous in all dimensions.
Parser C begins with an empty chart. The only in-
ferences that can fire in this state are those with no
antecedent items (though they can have antecedent
production rules). In Logic C, a10a12a11
a23a14a13a16a15a18a17 a24
is the value
that the grammar assigns to the terminal productiona13
a5
a17
. The range of this value depends on the
semiring used. A Scan inference can fire for the a19 th
word a20a21a5 a3a22 in component a7 for every terminal pro-
duction in the grammar where a20a23a5 a3a22 appears in the
a7 th component. Each Scan consequent has exactly
one active d-span, and that d-span always has the
form
a23
a19a6a24
a10a13a12
a19
a24
because such items always span one
word, so the distance between the item’s boundaries
is always one.
The Compose inference in Logic C is the same
as in Melamed’s Parser A, using slightly different
notation: In Logic C, the function a10a26a25
a23a14a13a27a15a18a28
a12
a17
a12a30a29
a24
represents the value that the grammar assigns to the
nonterminal production
a13
a5a8a7
a28 a23a31a17
a29
a24
. Parser C can
compose two items if their labels appear on the RHS
of a production rule in the grammar, and if the con-
tiguity and relative order of their intervals is consis-
tent with the role templates of that production rule.
Item Form: a32 a2 a49a51
a15
a52
a49
a51a16a33 Goal:
a32a35a34
a49
a51
a15 a23a4a3
a12
a0a36a5
a24
a49
a51a37a33
Inference Rules
Scan component d, a10a38a8 a7 a8 a0 :
a39a41a40a43a42a44
a44a45
a23a25a24
a49
a5a47a46
a49
a2
a23a25a24
a5a49a48
a49
a51 a50
a23a25a24
a49
a5a47a46
a49
a20a43a5
a3a22
a23a25a24
a5a49a48
a49
a51
a51a14a52
a52
a53
a54a55
a55
a56
a23a25a24
a49
a5a47a46
a49
a2
a23a25a24
a5a49a48
a49
a51 a50
a23a25a24
a49
a5a47a46
a49a23
a19a57a24
a10a13a12
a19
a24
a23a25a24
a5a49a48
a49
a51
a58a59
a59
a60
Compose:
a61a63a62a65a64
a66a68a67a69
a64
a66a71a70
a61a35a72a37a64
a66a68a67a73
a64
a66a71a70a36a74a76a75
a32a78a77
a64
a66a76a67a69
a64
a66a80a79a81a73
a64
a66
a14
a62a82a64
a66
a14
a72a37a64
a66
a33
a10
a77
a64
a66 a67a69
a64
a66a37a83 a73
a64
a66
a18
Figure 3: Logic C (“C” for CKY)
These constraints are enforced by the d-span opera-
tors a84 and a85 .
Parser C is conceptually simpler than the syn-
chronous parsers of Wu (1997), Alshawi et al.
(2000), and Melamed (2003), because it uses only
one kind of item, and it never composes terminals.
The inference rules of Logic C are the multidimen-
sional generalizations of inference rules with the
same names in ordinary CKY parsers. For exam-
ple, given a suitable grammar and the input (imper-
ative) sentence pair Wash the dishes / Pasudu moy,
Parser C might make the 9 inferences in Figure 4 to
infer the multitree in Figure 2. Note that there is one
inference per internal node of the multitree.
Goodman (1999) shows how a parsing logic can
be combined with various semirings to compute dif-
ferent kinds of information about the input. De-
pending on the chosen semiring, a parsing logic can
compute the single most probable derivation and/or
its probability, the a86 most probable derivations
and/or their total probability, all possible derivations
and/or their total probability, the number of possi-
ble derivations, etc. All the parsing semirings cat-
alogued by Goodman apply the same way to syn-
chronous parsing, and to all the other classes of al-
gorithms discussed in this paper.
The class of synchronous parsers includes some
algorithms for word alignment. A translation lexi-
con (weighted or not) can be viewed as a degenerate
MTG (not in GCNF) where every production has a
link of terminals on the RHS. Under such an MTG,
the logic of word alignment is the one in Melamed
(2003)’s Parser A, but without Compose inferences.
The only other difference is that, instead of a single
item, the Goal of word alignment is any set of items
that covers all dimensions of the input. This logic
can be used with the expectation semiring (Eisner,
2002) to find the maximum likelihood estimates of
the parameters of a word-to-word translation model.
An important application of Parser C is parameter
estimation for probabilistic MTGs (PMTGs). Eis-
ner (2002) has claimed that parsing under an expec-
tation semiring is equivalent to the Inside-Outside
algorithm for PCFGs. If so, then there is a straight-
forward generalization for PMTGs. Parameter es-
timation is beyond the scope of this paper, however.
The next section assumes that we have an MTG,
probabilistic or not, as required by the semiring.
4 Translation
A a0 -MTG can guide a synchronous parser to in-
fer the hidden structure of a a0 -component multi-
text. Now suppose that we have a a0 -MTG and an
input multitext with only a87 components, a87a89a88
a0 .
a12a1a0
a74
a2
a3 a25a27a26a29a28a46a30
a32a9a33 a67
a25a5a4a7a6a9a8
a32 a33a11a10
a7 a25a27a26a29a28a31a30
a32a9a33 a67
a32a13a12
a14 a12
a33
a32a9a33
a21
a14
a0
a74
a2
a3 a44
a35
a28a46a30
a32a9a33 a67a16a15a18a17
a6a9a8a20a19a21a6
a32 a33 a10
a7 a44
a35
a28a46a30
a32a9a33 a67
a32
a16a11a14
a14 a33
a32a9a33
a21
a16a7a0
a74
a2
a3 a44
a32 a33 a67a23a22
a8a20a19
a32 a33a24a10
a7 a44
a32a9a33 a67
a32
a12a15a14a42a16
a33
a32a9a33
a21
a25
a0
a74
a2
a3
a32a9a33
a40a47a26a29a28
a67
a32a9a33
a40a26a4a27a6a29a28
a15
a28
a10
a7
a32 a33
a40a47a26 a28
a67
a32a9a33
a32a13a12
a14a20a12
a33
a21 a30
a0
a74
a2
a3
a32 a33
a34a27a35a15a36 a67
a32a9a33
a31a33a32a35a34
a10
a7
a32a9a33
a34a27a35a37a36 a67
a32a9a33
a32
a12a37a14a17a16
a33
a21
a36
a0
a7 a44
a35
a28a31a30
a32a9a33 a67
a32
a16 a14
a14 a33
a32a9a33
a21
a7
a32 a33
a40a47a26a29a28
a67
a32a9a33
a32a13a12
a14a20a12
a33
a21
a74a76a75
a3 a38
a38
a67
a10a13a12 a18
a10a13a12 a18
a14
a44
a35
a28a31a30
a32a9a33
a14
a32 a33
a40a47a26a29a28
a10
a7 a38
a38
a67
a32
a16 a14
a14 a33
a32a13a12
a14a20a12
a33
a21
a37
a0
a7 a44
a32 a33 a67
a32
a12a37a14a17a16
a33
a32a9a33
a21
a7 a38
a38
a67
a32
a16 a14
a14 a33
a32a13a12
a14a20a12
a33
a21
a74a76a75
a3a50a38a41a40
a38a41a40
a67
a10a22a12a37a14a17a16a19a18
a10a22a12a20a18
a14
a44
a32a9a33
a14
a38
a38
a10
a7 a38 a40
a38 a40
a67
a32
a12a15a14
a14 a33
a32a38a12
a14 a12
a33
a21
a39
a0
a7 a25a27a26a29a28a46a30
a32a9a33 a67
a32a13a12
a14 a12
a33
a32a9a33
a21
a7
a32 a33
a34a27a35a37a36 a67
a32a9a33
a32
a12a37a14a17a16
a33
a21
a74a76a75
a3a27a23
a23
a67
a10a13a12 a18
a10a13a12 a18
a14
a25a27a26a29a28a46a30
a32a9a33
a14
a32 a33
a34a27a35a15a36 a10
a7 a23
a23
a67
a32a13a12
a14a20a12
a33
a32
a12a37a14a17a16
a33
a21
a40
a0
a7 a23
a23
a67
a32a13a12
a14a20a12
a33
a32
a12a37a14a17a16
a33
a21
a7 a38a41a40
a38a41a40
a67
a32
a12a37a14
a14 a33
a32a13a12
a14a20a12
a33
a21
a74a76a75
a3 a8
a8
a67
a10a22a12a37a14a17a16a19a18
a10a16 a14a20a12a20a18
a14
a23
a23
a14
a38 a40
a38 a40
a10
a7 a8
a8
a67
a32a38a12
a14
a14 a33
a32a38a12
a14a42a16
a33
a21
Figure 4: Possible sequence of inferences of
Parser C on input Wash the dishes / Pasudu moy.
When some of the component texts are missing,
we can ask the parser to infer a a0 -dimensional
multitree that includes the missing components.
The resulting multitree will cover the a87 input
components/dimensions among its a0 dimensions.
It will also express the a0 a24 a87 output compo-
nents/dimensions, along with their syntactic struc-
tures.
Item Form: a32 a2 a49a51
a15
a52
a49
a41 a33 Goal:
a32 a34
a49
a51
a15 a23a4a3
a12
a0a36a5
a24
a49
a41 a33
Inference Rules
Scan component a7
a10a38a8
a7
a8
a87 :
a39a41a40a43a42a44
a44a45
a23a25a24
a49
a5a47a46
a49
a2
a23a25a24
a5a49a48
a49
a41 a50
a23a25a24
a49
a5a2a46
a49
a20a43a5
a3a22
a23a25a24
a5 a48
a49
a41
a51a14a52
a52
a53
a54a55
a55
a55
a55
a55
a56
a23a25a24
a49
a5a2a46
a49
a2
a23a25a24
a5 a48
a49
a41
a23a25a24
a41
a48
a49
a51
a50
a23a25a24
a49
a5a47a46
a49a23
a19a57a24
a10a13a12
a19
a24
a23a25a24
a5a49a48
a49
a41
a58a59
a59
a59
a59
a59
a60
Load component a7 ,
a87 a88 a7
a8 a0 :
a39 a40a82a42a44
a44a45
a23a25a24
a41
a48
a49
a5a47a46
a49
a2
a23a25a24
a5a49a48
a49
a51 a50
a23a25a24
a41
a48
a49
a5a47a46
a49
a42
a23a25a24
a5a49a48
a49
a51
a51 a52
a52
a53
a54a55
a55
a55
a55
a55
a56
a23a25a24
a49
a41
a23a25a24
a41
a48
a49
a5a47a46
a49
a2
a23a25a24
a5a49a48
a49
a51
a50
a23a25a24
a49
a41
a58a59
a59
a59
a59
a59
a60
Compose:
a61 a62a23a64
a66a68a67a69
a64
a43 a70
a61 a72a37a64
a66a68a67a73
a64
a43a2a70 a74 a75a45a44
a77
a64
a66a68a67a46
a49
a41
a85 a52
a49
a41
a28
a41
a48
a49
a51
a14
a62a23a64
a66
a14
a72a37a64
a66a48a47
a10
a77
a64
a66 a67a69
a64
a43 a83 a73
a64
a43
a18
Figure 5: Logic CT (“T” for Translation)
Figure 5 shows Logic CT, which is a generaliza-
tion of Logic C. Translator CT is any parser based
on Logic CT. The items of Translator CT have a
a0 -dimensional label vector, as usual. However,
their d-span vectors are only a87 -dimensional, be-
cause it is not necessary to constrain absolute word
positions in the output dimensions. Instead, we need
only constrain the cardinality of the output nonter-
minals, which is accomplished by the role templatesa28
a41
a48
a49
a51 in the
a10a38a25 term. Translator CT scans only
the input components. Terminal productions with
active output components are simply loaded from
the grammar, and their LHSs are added to the chart
without d-span information. Composition proceeds
as before, except that there are no constraints on the
role templates in the output dimensions – the role
templates in
a28
a41
a48
a49
a51 are free variables.
In summary, Logic CT differs from Logic C as
follows:
a49 Items store no position information (d-spans)
for the output components.
a49 For the output components, the Scan infer-
ences are replaced by Load inferences, which
are not constrained by the input.
a49 The Compose inference does not constrain the
d-spans of the output components. (Though it
still constrains their cardinality.)
We have constructed a translator from a syn-
chronous parser merely by relaxing some con-
straints on the output dimensions. Logic C is just
Logic CT for the special case where a87a1a0 a0 . The
relationship between the two classes of algorithms
is easier to see from their declarative logics than it
would be from their procedural pseudocode or equa-
tions.
Like Parser C, Translator CT can Compose items
that have no dimensions in common. If one of the
items is active only in the input dimension(s), and
the other only in the output dimension(s), then the
inference is, de facto, a translation. The possible
translations are determined by consulting the gram-
mar. Thus, in addition to its usual function of eval-
uating syntactic structures, the grammar simultane-
ously functions as a translation model.
Logic CT can be coupled with any parsing semir-
ing. For example, under a boolean semiring, this
logic will succeed on an a87 -dimensional input if and
only if it can infer a a0 -dimensional multitree whose
root is the goal item. Such a tree would contain aa23
a0
a24 a87
a24
-dimensional translation of the input. Thus,
under a boolean semiring, Translator CT can deter-
mine whether a translation of the input exists.
Under an inside-probability semiring, Transla-
tor CT can compute the total probability of all mul-
titrees containing the input and its translations in the
a0
a24a65a87 output components. All these derivation trees,
along with their probabilities, can be efficiently rep-
resented as a packed parse forest, rooted at the goal
item. Unfortunately, finding the most probable out-
put string still requires summing probabilities over
an exponential number of trees. This problem was
shown to be NP-hard in the one-dimensional case
(Sima’an, 1996). We have no reason to believe that
it is any easier when a0a3a2 a10 .
The Viterbi-derivation semiring would be the
most often used with Translator CT in prac-
tice. Given a a0 -PMTG, Translator CT can
use this semiring to find the single most prob-
able a0 -dimensional multitree that covers the
a87 -dimensional input. The multitree inferred by the
translator will have the words of both the input and
the output components in its leaves. For example,
given a suitable grammar and the input Pasudu moy,
Translator CT could infer the multitree in Figure 2.
The set of inferences would be exactly the same as
those listed in Figure 4, except that the items would
have no d-spans in the English component.
In practice, we usually want the output as a string
tuple, rather than as a multitree. Under the vari-
ous derivation semirings (Goodman, 1999), Trans-
lator CT can store the output role templates
a28
a41
a48
a49
a51 in
each internal node of the tree. The intended order-
ing of the terminals in each output dimension can be
assembled from these templates by a linear-time lin-
earization post-process that traverses the finished
multitree in postorder.
To the best of our knowledge, Logic CT is the first
published translation logic to be compatible with all
of the semirings catalogued by Goodman (1999),
among others. It is also the first to simultaneously
accommodate multiple input components and mul-
tiple output components. When a source docu-
ment is available in multiple languages, a translator
can benefit from the disambiguating information in
each. Translator CT can take advantage of such in-
formation without making the strong independence
assumptions of Och & Ney (2001). When output is
desired in multiple languages, Translator CT offers
all the putative benefits of the interlingual approach
to MT, including greater efficiency and greater con-
sistency across output components. Indeed, the lan-
guage of multitrees can be viewed as an interlingua.
5 Synchronization
We have explored inference of a87 -dimensional multi-
trees under a a0 -dimensional grammar, where a0a5a4
a87 . Now we generalize along the other axis of
Figure 1(a). Multitext synchronization is most of-
ten used to infer a87 -dimensional multitrees without
the benefit of an a87 -dimensional grammar. One ap-
plication is inducing a parser in one language from a
parser in another (L¨u et al., 2002). The application
that is most relevant to this paper is bootstrapping an
a87 -dimensional grammar. In theory, it is possible to
induce a PMTG from multitext in an unsupervised
manner. A more reliable way is to start from a
corpus of multitrees — a multitreebank.3
We are not aware of any multitreebanks at this
time. The most straightforward way to create one is
to parse some multitext using a synchronous parser,
such as Parser C. However, if the goal is to boot-
strap an a87 -PMTG, then there is no a87 -PMTG that can
evaluate the a10 terms in the parser’s logic. Our solu-
tion is to orchestrate lower-dimensional knowledge
sources to evaluate the a10 terms. Then, we can use
the same parsing logic to synchronize multitext into
a multitreebank.
To illustrate, we describe a relatively simple syn-
chronizer, using the Viterbi-derivation semiring.4
Under this semiring, a synchronizer computes the
single most probable multitree for a given multitext.
3In contrast, a parallel treebank might contain no informa-
tion about translational equivalence.
4The inside-probability semiring would be required for
maximum-likelihood synchronization.
ya
kota
kormil
I fed the cat
Figure 6: Synchronization. Only one synchronous
dependency structure (dashed arrows) is compatible
with the monolingual structure (solid arrows) and
word alignment (shaded cells).
If we have no suitable PMTG, then we can use other
criteria to search for trees that have high probability.
We shall consider the common synchronization sce-
nario where a lexicalized monolingual grammar is
available for at least one component.5 Also, given
a tokenized set of a87 -tuples of parallel sentences,
it is always possible to estimate a word-to-word
translation model a0a2a1
a23a4a3
a51
a49a6a5
a3
a41
a51
a48
a49
a24
(e.g., Och & Ney,
2003).6
A word-to-word translation model and a lexical-
ized monolingual grammar are sufficient to drive a
synchronizer. For example, in Figure 6 a mono-
lingual grammar has allowed only one dependency
structure on the English side, and a word-to-word
translation model has allowed only one word align-
ment. The syntactic structures of all dimensions
of a multitree are isomorphic up to reordering of
sibling nodes and deletion. So, given a fixed cor-
respondence between the tree leaves (i.e. words)
across components, choosing the optimal structure
for one component is tantamount to choosing the
optimal synchronous structure for all components.7
Ignoring the nonterminal labels, only one depen-
dency structure is compatible with these constraints
– the one indicated by dashed arrows. Bootstrap-
ping a PMTG from a lower-dimensional PMTG and
a word-to-word translation model is similar in spirit
to the way that regular grammars can help to es-
timate CFGs (Lari & Young, 1990), and the way
that simple translation models can help to bootstrap
more sophisticated ones (Brown et al., 1993).
5Such a grammar can be induced from a treebank, for exam-
ple. We are currently aware of treebanks for English, Spanish,
German, Chinese, Czech, Arabic, and Korean.
6Although most of the literature discusses word transla-
tion models between only two languages, it is possible to
combine several 2D models into a higher-dimensional model
(Mann & Yarowsky, 2001).
7Except where the unstructured components have words
that are linked to nothing.
We need only redefine the a10 terms in a way that
does not rely on an a87 -PMTG. Without loss of gener-
ality, we shall assume a a0 -PMTG that ranges over
the first a0 components, where a0 a88 a87 . We shall
then refer to the a0 structured components and the
a87 a24
a0 unstructured components.
We begin with a10a65a11 . For the structured compo-
nents a7 a12a20a10 a8 a7 a8 a0 , we retain the grammar-
based definition: a10 a11
a23
a2
a5
a9a8a7
a5
a16
a15
a7
a5
a24
a0a9a0a2a1
a23
a7
a5 a5
a2
a5
a24
,8
where the latter probability can be looked up in
our a0 -PMTG. For the unstructured components,
there are no useful nonterminal labels. Therefore,
we assume that the unstructured components use
only one (dummy) nonterminal label a10 , so that
a10a82a11
a23
a2
a5
a9a8a7
a5
a16
a15
a7
a5
a24
a0
a10 if a2
a0a11a10 and undefined oth-
erwise for a0 a88 a7 a8 a87 .
Our treatment of nonterminal productions begins
by applying the chain rule9
a10a38a25
a23
a2 a49
a41
a9a8a7 a49
a41
a16
a15a18a28
a49
a41
a12
a3
a49
a41
a9a12 a49
a41
a16a25a12 a4 a49
a41
a9a8a7 a49
a41
a16
a24
a0a13a0a14a1
a23a31a28
a49
a41
a12a15a12 a49
a41
a12
a3
a49
a41
a12 a4 a49
a41
a5
a2 a49
a41
a12a16a7 a49
a41
a24
(3)
a0a13a0a14a1
a23a31a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a51
a12 a4 a49
a51
a5
a2 a49
a41
a12a16a7 a49
a41
a24
a17
a0a14a1
a23
a3
a51
a48
a49
a41
a12 a4
a51
a48
a49
a41
a5
a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a51
a12 a4 a49
a51
a12 a2 a49
a41
a12a16a7 a49
a41
a24
a17
a0a14a1
a23
a12
a51
a48
a49
a41
a5
a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a41
a12 a4 a49
a41
a12 a2 a49
a41
a12a16a7 a49
a41
a24
a17
a0a14a1
a23a31a28
a51
a48
a49
a41
a5
a28
a49
a51
a12a15a12 a49
a41
a12
a3
a49
a41
a12 a4 a49
a41
a12 a2 a49
a41
a12a16a7 a49
a41
a24
(4)
and continues by making independence assump-
tions. The first assumption is that the structured
components of the production’s RHS are condition-
ally independent of the unstructured components of
its LHS:
a0a2a1
a23a31a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a51
a12 a4 a49
a51
a5
a2 a49
a41
a12a16a7 a49
a41
a24
a0
a0a13a0a2a1
a23a31a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a51
a12 a4 a49
a51
a5
a2 a49
a51
a12a16a7 a49
a51
a24
(5)
The above probability can be looked up in the
a0 -PMTG. Second, since we have no useful non-
terminals in the unstructured components, we let
a0a2a1
a23
a3
a51
a48
a49
a41
a12 a4
a51
a48
a49
a41
a5
a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a51
a12 a4 a49
a51
a12 a2 a49
a41
a12a16a7 a49
a41
a24
a0
a10
(6)
if
a3
a51
a48
a49
a41
a0
a4
a51
a48
a49
a41
a0a18a10
a51
a48
a49
a41 and
a3
otherwise. Third,
we assume that the word-to-word translation proba-
bilities are independent of anything else:
a0a2a1
a23
a12
a51
a48
a49
a41
a5
a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a41
a12 a4 a49
a41
a12 a2 a49
a41
a12a16a7 a49
a41
a24
a0
a0a13a0a2a1
a23
a12
a51
a48
a49
a41
a5
a12 a49
a51
a24
(7)
8We have ignored lexical heads so far, but we need them for
this synchronizer.
9The procedure is analogous when the heir is the first non-
terminal link on the RHS, rather than the second.
These probabilities can be obtained from our word-
to-word translation model, which would typically
be estimated under exactly such an independence
assumption. Finally, we assume that the output role
templates are independent of each other and uni-
formly distributed, up to some maximum cardinal-
ity a0 . Let a1
a23
a0
a24
be the number of unique role tem-
plates of cardinality a0 or less. Then
a0a14a1
a23a31a28
a51
a48
a49
a41
a12
a5
a28
a49
a51
a12a15a12 a49
a41
a12
a3
a49
a41
a12 a4 a49
a41
a12 a2 a49
a41
a12a16a7 a49
a41
a24
a0 (8)
a0a13a0a2a1
a23a31a28
a51
a48
a49
a41
a24
a0
a41
a2
a5a4a3
a51
a48
a49
a10
a1
a23
a0
a24
a0
a10
a1
a23
a0
a24
a41
a46
a51
Under Assumptions 5–8,
a10a12a25
a23
a2 a49
a41
a9a8a7 a49
a41
a16
a15a18a28
a49
a41
a12
a3
a49
a41
a9a12 a49
a41
a16a25a12 a4 a49
a41
a9a8a7 a49
a41
a16
a24
a0 (9)
a0
a0a2a1
a23a31a28
a49
a51
a12a15a12 a49
a51
a12
a3
a49
a51
a12 a4 a49
a51
a5
a2 a49
a51
a12a16a7 a49
a51
a24a6a5
a0a2a1
a23
a12
a51
a48
a49
a41
a5
a12 a49
a51
a24
a1
a23
a0
a24
a41
a46
a51
if
a3
a51
a48
a49
a41
a0
a4
a51
a48
a49
a41
a0 a10
a51
a48
a49
a41 and 0 otherwise. We
can use these definitions of the grammar terms in the
inference rules of Logic C to synchronize multitexts
into multitreebanks.
More sophisticated synchronization methods are
certainly possible. For example, we could project
a part-of-speech tagger (Yarowsky & Ngai, 2001)
to improve our estimates in Equation 6. Yet, de-
spite their relative simplicity, the above methods
for estimating production rule probabilities use all
of the available information in a consistent man-
ner, without double-counting. This kind of synchro-
nizer stands in contrast to more ad-hoc approaches
(e.g., Matsumoto, 1993; Meyers, 1996; Wu, 1998;
Hwa et al., 2002). Some of these previous works
fix the word alignments first, and then infer com-
patible parse structures. Others do the opposite. In-
formation about syntactic structure can be inferred
more accurately given information about transla-
tional equivalence, and vice versa. Commitment to
either kind of information without consideration of
the other increases the potential for compounded er-
rors.
6 Multitree-based Statistical MT
Multitree-based statistical machine translation
(MTSMT) is an architecture for SMT that revolves
around multitrees. Figure 7 shows how to build and
use a rudimentary MTSMT system, starting from
some multitext and one or more monolingual tree-
banks. The recipe follows:
T1. Induce a word-to-word translation model.
T2. Induce PCFGs from the relative frequencies of
productions in the monolingual treebanks.
T3. Synchronize some multitext, e.g. using the ap-
proximations in Section 5.
T4. Induce an initial PMTG from the relative fre-
quencies of productions in the multitreebank.
T5. Re-estimate the PMTG parameters, using a
synchronous parser with the expectation semir-
ing.
A1. Use the PMTG to infer the most probable mul-
titree covering new input text.
A2. Linearize the output dimensions of the multi-
tree.
Steps T2, T4 and A2 are trivial. Steps T1, T3, T5,
and A1 are instances of the generalized parsers de-
scribed in this paper.
Figure 7 is only an architecture. Computational
complexity and generalization error stand in the
way of its practical implementation. Nevertheless,
it is satisfying to note that all the non-trivial algo-
rithms in Figure 7 are special cases of Translator CT.
It is therefore possible to implement an MTSMT
system using just one inference algorithm, param-
eterized by a grammar, a semiring, and a search
strategy. An advantage of building an MT system in
this manner is that improvements invented for ordi-
nary parsing algorithms can often be applied to all
the main components of the system. For example,
Melamed (2003) showed how to reduce the com-
putational complexity of a synchronous parser by
a7
a23
a0
a51
a24
, just by changing the logic. The same opti-
mization can be applied to the inference algorithms
in this paper. With proper software design, such op-
timizations need never be implemented more than
once. For simplicity, the algorithms in this paper
are based on CKY logic. However, the architecture
in Figure 7 can also be implemented using general-
izations of more sophisticated parsing logics, such
as those inherent in Earley or Head-Driven parsers.
7 Conclusion
This paper has presented generalizations of ordinary
parsing that emerge when the grammar and/or the
input can be multidimensional. Along the way, it
has elucidated the relationships between ordinary
parsers and other classes of algorithms, some pre-
viously known and some not. It turns out that, given
some multitext and a monolingual treebank, a rudi-
mentary multitree-based statistical machine transla-
tion system can be built and applied using only gen-
eralized parsers and some trivial glue.
There are three research benefits of using gener-
alized parsers to build MT systems. First, we can
synchronization
PCFG(s)
word−to−word
translation
model
parameter
parsing
synchronous
estimation via
PMTG
word
alignment
monolingual
treebank(s)
multitext
training
multitreebank relativefrequency
computation
relative
frequency
computation
translation
input
multitext
multitree
output
multitext
linearization
A2
A1
T3
T5
T1
T2
T4
training
application
Figure 7: Data-flow diagram for a rudimentary MTSMT system based on generalizations of parsing.
take advantage of past and future research on mak-
ing parsers more accurate and more efficient. There-
fore, second, we can concentrate our efforts on
better models, without worrying about MT-specific
search algorithms. Third, more generally and most
importantly, this approach encourages MT research
to be less specialized and more transparently related
to the rest of computational linguistics.
Acknowledgments
Thanks to Joseph Turian, Wei Wang, Ben Wellington, and the
anonymous reviewers for valuable feedback. This research was
supported by an NSF CAREER Award, the DARPA TIDES
program, and an equipment gift from Sun Microsystems.
References
A. Aho & J. Ullman (1969) “Syntax Directed Translations and
the Pushdown Assembler,” Journal of Computer and System
Sciences 3, 37-56.
H. Alshawi, S. Bangalore, & S. Douglas (2000) “Learning De-
pendency Translation Models as Collections of Finite State
Head Transducers,” Computational Linguistics 26(1):45-60.
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, & R. L. Mer-
cer (1993) “The Mathematics of Statistical Machine Trans-
lation: Parameter Estimation,” Computational Linguistics
19(2):263–312.
J. Goodman (1999) “Semiring Parsing,” Computational Lin-
guistics 25(4):573–305.
R. Hwa, P. Resnik, A. Weinberg, & O. Kolak (2002) “Evaluat-
ing Translational Correspondence using Annotation Projec-
tion,” Proceedings of the ACL.
J. Eisner (2002) “Parameter Estimation for Probabilistic Finite-
State Transducers,” Proceedings of the ACL.
K. Lari & S. Young (1990) “The Estimation of Stochas-
tic Context-Free Grammars using the Inside-Outside Algo-
rithm,” Computer Speech and Language Processing 4:35–
56.
Y. L¨u, S. Li, T. Zhao, & M. Yang (2002) “Learning Chinese
Bracketing Knowledge Based on a Bilingual Language
Model,” Proceedings of COLING.
G. S. Mann & D. Yarowsky (2001) “Multipath Translation
Lexicon Induction via Bridge Languages,” Proceedings of
HLT/NAACL.
Y. Matsumoto (1993) “Structural Matching of Parallel Texts,”
Proceedings of the ACL.
I. D. Melamed (2003) “Multitext Grammars and Synchronous
Parsers,” Proceedings of HLT/NAACL.
I. D. Melamed, G. Satta, & B. Wellington (2004) “General-
ized Multitext Grammars,” Proceedings of the ACL (this
volume).
A. Meyers, R. Yangarber, & R. Grishman (1996) “Alignment of
Shared Forests for Bilingual Corpora,” Proceedings of COL-
ING.
F. Och & H. Ney (2001) “Statistical Multi-Source Translation,”
Proceedings of MT Summit VIII.
F. Och & H. Ney (2003) “A Systematic Comparison of Various
Statistical Alignment Models,” Computational Linguistics
29(1):19-51.
K. Sima’an (1996) “Computational Complexity of Probabilis-
tic Disambiguation by means of Tree-Grammars,” Proceed-
ings of COLING.
D. Wu (1996) “A polynomial-time algorithm for statistical ma-
chine translation,” Proceedings of the ACL.
D. Wu (1997) “Stochastic inversion transduction grammars and
bilingual parsing of parallel corpora,” Computational Lin-
guistics 23(3):377-404.
D. Wu & H. Wong (1998) “Machine translation with a stochas-
tic grammatical channel,” Proceedings of the ACL.
K. Yamada & K. Knight (2002) “A Decoder for Syntax-based
Statistical MT,” Proceedings of the ACL.
D. Yarowsky & G. Ngai (2001) “Inducing Multilingual POS
Taggers and NP Bracketers via Robust Projection Across
Aligned Corpora,” Proceedings of the NAACL.
