Generalized Multitext Grammars
I. Dan Melamed
Computer Science Department
New York University
715 Broadway, 7th Floor
New York, NY, 10003, USA
a0 lastname
a1 @cs.nyu.edu
Giorgio Satta
Dept. of Information Eng’g
University of Padua
via Gradenigo 6/A
I-35131 Padova, Italy
a0 lastname
a1 @dei.unipd.it
Benjamin Wellington
Computer Science Department
New York University
715 Broadway, 7th Floor
New York, NY, 10003, USA
a0 lastname
a1 @cs.nyu.edu
Abstract
Generalized Multitext Grammar (GMTG) is a syn-
chronous grammar formalism that is weakly equiv-
alent to Linear Context-Free Rewriting Systems
(LCFRS), but retains much of the notational and in-
tuitive simplicity of Context-Free Grammar (CFG).
GMTG allows both synchronous and independent
rewriting. Such flexibility facilitates more perspic-
uous modeling of parallel text than what is possible
with other synchronous formalisms. This paper in-
vestigates the generative capacity of GMTG, proves
that each component grammar of a GMTG retains
its generative power, and proposes a generalization
of Chomsky Normal Form, which is necessary for
synchronous CKY-style parsing.
1 Introduction
Synchronous grammars have been proposed for
the formal description of parallel texts representing
translations of the same document. As shown by
Melamed (2003), a plausible model of parallel text
must be able to express discontinuous constituents.
Since linguistic expressions can vanish in transla-
tion, a good model must be able to express inde-
pendent (in addition to synchronous) rewriting. In-
version Transduction Grammar (ITG) (Wu, 1997)
and Syntax-Directed Translation Schema (SDTS)
(Aho and Ullman, 1969) lack both of these prop-
erties. Synchronous Tree Adjoining Grammar
(STAG) (Shieber, 1994) lacks the latter and allows
only limited discontinuities in each tree.
Generalized Multitext Grammar (GMTG) offers
a way to synchronize Mildly Context-Sensitive
Grammar (MCSG), while satisfying both of the
above criteria. The move to MCSG is motivated
by our desire to more perspicuously account for
certain syntactic phenomena that cannot be easily
captured by context-free grammars, such as clitic
climbing, extraposition, and other types of long-
distance movement (Becker et al., 1991). On the
other hand, MCSG still observes some restrictions
that make the set of languages it generates less ex-
pensive to analyze than the languages generated by
(properly) context-sensitive formalisms.
More technically, our proposal starts from Mul-
titext Grammar (MTG), a formalism for synchro-
nizing context-free grammars recently proposed by
Melamed (2003). In MTG, synchronous rewriting
is implemented by means of an indexing relation
that is maintained over occurrences of nonterminals
in a sentential form, using essentially the same ma-
chinery as SDTS. Unlike SDTS, MTG can extend
the dimensionality of the translation relation be-
yond two, and it can implement independent rewrit-
ing by means of partial deletion of syntactic struc-
tures. Our proposal generalizes MTG by moving
from component grammars that generate context-
free languages to component grammars whose gen-
erative power is equivalent to Linear Context-Free
Rewriting Systems (LCFRS), a formalism for de-
scribing a class of MCSGs. The generalization is
achieved by allowing context-free productions to
rewrite tuples of strings, rather than single strings.
Thus, we retain the intuitive top-down definition of
synchronous derivation original in SDTS and MTG
but not found in LCFRS, while extending the gen-
erative power to linear context-free rewriting lan-
guages. In this respect, GMTG has also been in-
spired by the class of Local Unordered Scattered
Context Grammars (Rambow and Satta, 1999). A
syntactically very different synchronous formalism
involving LCFRS has been presented by Bertsch
and Nederhof (2001).
This paper begins with an informal description of
GMTG. It continues with an investigation of this
formalism’s generative capacity. Next, we prove
that in GMTG each component grammar retains its
generative power, a requirement for synchronous
formalisms that Rambow and Satta (1996) called
the “weak language preservation property.” Lastly,
we propose a synchronous generalization of Chom-
sky Normal Form, which lays the groundwork for
synchronous parsing under GMTG using a CKY-
style algorithm (Younger, 1967; Melamed, 2004).
2 Informal Description and Comparisons
GMTG is a generalization of MTG, which is itself
a generalization of CFG to the synchronous case.
Here we present MTG in a new notation that shows
the relation to CFG more clearly. For example, the
following MTG productions can generate the multi-
text [(I fed the cat), (ya kota kormil)]:1
a0 (S)
a1 (S)a2a4a3
a0a6a5 PN
a7 VPa8a10a9a11a1
a5 PN
a7 VPa8a12a9a13a2 (1)
a0a14a5 PN
a9a11a1
a5 PN
a9a13a2a15a3
a0a6a5 I
a9a11a1
a5 ya
a9a13a2 (2)
a0a14a5 VP
a9a11a1
a5 VP
a9a13a2a15a3
a0a6a5 V
a7 NPa8 a9a11a1
a5 NP
a8 Va7 a9a13a2 (3)
a0a14a5 V
a9a11a1
a5 V
a9a13a2a15a3
a0a6a5 fed
a9a11a1
a5 kormil
a9a13a2 (4)
a0a14a5 NP
a9a11a1
a5 NP
a9a13a2a15a3
a0a6a5 D
a7 Na8 a9a11a1
a5 N
a8 a9a13a2 (5)
a0a6a5 D
a9a11a1
a5
a9a13a2a16a3
a0a6a5 the
a9a11a1
a5
a9a13a2 (6)
a0a14a5 N
a9a11a1
a5 N
a9a13a2a15a3
a0a6a5 cat
a9a11a1
a5 kota
a9a13a2 (7)
Each production in this example has two com-
ponents, the first modeling English and the sec-
ond (transliterated) Russian. Nonterminals with the
same index must be rewritten together (synchronous
rewriting). One strength of MTG, and thus also
GMTG, is shown in Productions (5) and (6). There
is a determiner in English, but not in Russian, so
Production (5) does not have the nonterminal D in
the Russian component and (6) applies only to the
English component (independent rewriting). For-
malisms that do not allow independent rewriting re-
quire a corresponding a17 to appear in the second
component on the right-hand side (RHS) of Produc-
tion (5), and this a17 would eventually generate the
empty string. This approach has the disadvantage
that it introduces spurious ambiguity about the po-
sition of the “empty” nonterminal with respect to
the other nonterminals in its component. Spurious
ambiguity leads to wasted effort during parsing.
GMTG’s implementation of independent rewrit-
ing through the empty tuple () serves a very differ-
ent function from the empty string. Consider the
following GMTG:
a0a6a5a19a18
a9a11a1
a5a19a18
a9a13a2a16a3
a0a6a5a21a20
a9a11a1
a5a21a22
a9a13a2 (8)
a0a6a5a19a18
a9a11a1
a5a19a18
a9a13a2a16a3
a0a6a5a24a23
a7 a9a11a1
a5a21a25
a8 a9a13a2 (9)
a0a6a5a24a23
a9a11a1
a5
a9a13a2a16a3
a0a6a5a27a26
a9a11a1
a5
a9a13a2a29a28
a0a6a5a21a30
a9a11a1
a5
a9a13a2a31a28
a0a6a5a21a32
a9a11a1
a5
a9a13a2 (10)
a0a6a5
a9a11a1
a5a21a25
a9a13a2a33a3
a0a6a5
a9a11a1
a5a27a34
a9a13a2a31a28
a0a6a5
a9a11a1
a5a27a35
a9a13a2a31a28
a0a6a5
a9a11a1
a5a37a36
a9a13a2 (11)
Production (8) asserts that symbol a20 vanishes in
translation. Its application removes both of the non-
terminals on the left-hand side (LHS), pre-empting
any other production. In contrast, Production (9)
1We write production components both side by side and one
above another to save space, but each component is always in
parentheses.
explicitly relaxes the synchronization constraint, so
that the two components can be rewritten indepen-
dently. The other six productions make assertions
about only one component and are agnostic about
the other component. Incidentally, generating the
same language with only fully synchronized pro-
ductions would raise the number of required pro-
ductions to 11, so independent rewriting also helps
to reduce grammar size.
Independent rewriting is also useful for model-
ing paraphrasing. Take, for example, [(Tim got a
pink slip), (Tim got laid off )]. While the two sen-
tences have the same meaning, the objects of their
verb phrases are structured very differently. GMTG
can express their relationships as follows:
a0a6a5 S
a9a11a1
a5 S
a9a13a2a15a3
a0a6a5 NP
a7 VPa8a10a9a11a1
a5 NP
a7 VPa8a38a9a13a2 (12)
a0a6a5 VP
a9a11a1
a5 VP
a9a13a2a4a3
a0a6a5 V
a7 NPa8 a9a11a1
a5 V
a7 PPa8 a9a13a2 (13)
a0a14a5 NP
a9a11a1
a5 PP
a9a13a2a4a3
a0a6a5 DT
a7 Aa8a39a1 Na40a10a9a11a1
a5 VB
a41a42a1 Ra43a10a9a13a2 (14)
a0a6a5 NP
a9a11a1
a5 NP
a9a13a2a4a3
a0a6a5 Tim
a9a11a1
a5 Tim
a9a13a2 (15)
a0a44a5 V
a9a11a1
a5 V
a9a13a2a15a3
a0a6a5 got
a9a11a1
a5 got
a9a13a2 (16)
a0a44a5 DT
a9a11a1
a5
a9a13a2a16a3
a0a6a5 a
a9a11a1
a5
a9a13a2 (17)
a0a6a5 A
a9a11a1
a5
a9a13a2a16a3
a0a6a5 pink
a9a11a1
a5
a9a13a2 (18)
a0a6a5 N
a9a11a1
a5
a9a13a2a16a3
a0a6a5 slip
a9a11a1
a5
a9a13a2 (19)
a0a44a5
a9a11a1
a5 VB
a9a13a2a15a3
a0a6a5
a9a11a1
a5 laid
a9a13a2 (20)
a0a6a5
a9a11a1
a5 R
a9a13a2a15a3
a0a6a5
a9a11a1
a5 off
a9a13a2 (21)
As described by Melamed (2003), MTG requires
production components to be contiguous, except af-
ter binarization. GMTG removes this restriction.
Take, for example, the sentence pair [(The doctor
treats his teeth), (El m´edico le examino los dientes)]
(Dras and Bleam, 2000). The Spanish clitic le and
the NP los dientes should both be paired with the
English NP his teeth, giving rise to a discontinuous
constituent in the Spanish component. A GMTG
fragment for the sentence is shown below:
a0a6a5 S
a9a11a1
a5 S
a9a13a2a4a3
a0a6a5 NP
a7 VPa8 a9a11a1
a5 NP
a7 VPa8 a9a13a2
a0a6a5 VP
a9a11a1
a5 VP
a9a13a2a33a3
a0a6a5 V
a7 NPa8 a9a11a1
a5 NP
a8 Va7 NPa8 a9a13a2
a0a6a5 NP
a9a11a1
a5 NP
a9a13a2a33a3
a0a6a5 The doctor
a9a11a1
a5 El m´edico
a9a13a2
a0a6a5 V
a9a11a1
a5 V
a9a13a2a4a3
a0a6a5 treats
a9a11a1
a5 examino
a9a13a2
a0a6a5 NP
a9a11a1
a5 NP
a1 NPa9a13a2a33a3
a0a6a5 his teeth
a9a11a1
a5 le
a1 los dientesa9a13a2
Note the discontinuity between le and los dientes.
Such discontinuities are marked by commas on both
the LHS and the RHS of the relevant component.
GMTG’s flexibility allows it to deal with many
complex syntactic phenomena. For example,
Becker et al. (1991) point out that TAG does not
have the generative capacity to model certain kinds
of scrambling in German, when the so-called “co-
occurrence constraint” is imposed, requiring the
derivational pairing between verbs and their com-
plements. They examine the English/German sen-
tence fragment [(... that the detective has promised
the client to indict the suspect of the crime), (...
daß des Verbrechens der Detektiv den Verd¨achtigen
dem Klienten zu ¨uberf¨uhren versprochen hat)]. The
verbs versprochen and ¨uberf¨uhren both have two
noun phrases as arguments. In German, these noun
phrases can appear to the left of the verbs in any
order. The following is a GMTG fragment for the
above sentence pair2:
a0
a5 S
a9
a5 S
a9a2a1
a3 a3
a5 N
a7a4a6a5a8a7 has promised Na8a9a8a10 a11a12a5a2a13a14a7a16a15S
a40
a9
a5
a15S
a40 N
a7
a17
a5a8a7a18a15S
a40 N
a8a19a20a10 a11a12a5a2a13a21a15S
a40 versprochen hat
a9a23a22
(22)
a0
a5
a15Sa9
a5
a15Sa1 a15Sa1 a15Sa9
a1
a3
a0
a5 to indict N
a7a24a26a25a27a24a12a28 a5a26a9a2a7 Na8a9a8a29a30a11a32a31a33a5 a9
a5 N
a8a34 a5a2a29a2a35 a1 Na7a34 a5a8a29a36a4 a1 zu ¨uberf¨uhrena9
a1
(23)
The discontinuities allow the noun arguments of
versprochen to be placed in any order with the noun
arguments of ¨uberf¨uhren. Rambow (1995) gives a
similar analysis.
3 Formal Definitions
Let a37a39a38 be a finite set of nonterminal symbols and
let a40 be the set of integers.3 We define a41 a5 a37 a38 a9a43a42
a44a46a45a48a47a50a49a36a51
a28
a45a53a52
a37 a38 a1a55a54
a52
a40a57a56 .
4 Elements of
a41
a5
a37 a38 a9
will be called indexed nonterminal symbols. In
what follows we also consider a finite set of termi-
nal symbols a37a59a58 , disjoint from a37 a38 , and work with
strings in a37a61a60
a62 , where
a37
a62
a42a63a41
a5
a37 a38 a9a18a64a65a37a59a58 . For a66
a52
a37a67a60
a62 ,
we define a68a12a69a71a70a73a72a75a74 a5 a66 a9a76a42 a44 a54 a28a33a66a77a42a78a66a80a79 a45 a47a50a49a36a51 a66a39a79 a79a24a1a81a66a80a79a37a1a82a66a80a79 a79 a52
a37 a60
a62
a1
a45a83a47a12a49a36a51a84a52
a41
a5
a37 a38 a9a85a56 , i.e. the set of indexes that ap-
pear in a66 .
An indexed tuple vector, or ITV, is a vector of
tuples of strings over a37 a62 , having the form
a66 a42
a0a6a5
a66
a7 a7
a1a14a86a14a86a14a86 a1a82a66
a7a2a87a89a88
a9a11a1a14a86a14a86a14a86 a1
a5
a66a91a90
a7
a1a14a86a14a86a14a86a12a1a82a66a59a90
a87a8a92
a9a13a2
where a17a94a93a96a95 , a97a14a98a99a93a101a100 and a66a102a98a104a103 a52 a37 a60a62 for a95a67a105a107a106a65a105 a17 ,
a95a83a105a109a108a110a105a111a97a14a98 . We write a66
a0
a106a19a2 , a95a83a105a111a106a112a105 a17 , to denote the
a106 -th component of a66 and a113
a5
a66
a0
a106a27a2a24a9 to denote the arity
of such a tuple, which is a97a46a98 . When a113
a5
a66
a0
a106a27a2a24a9a114a42a115a100 ,
a66
a0
a106a19a2 is the empty tuple, written
a5
a9 . This should not
be confused with a5a21a22 a9 , that is the tuple of arity one
containing the empty string. A link is an ITV where
2These are only a small subset of the necessary productions.
The subscripts on the nonterminals indicate what terminals they
will eventually yield; the terminal productions have been left
out to save space.
3Any other infinite set of indexes would suit too.
4The parentheses around indexes distinguish them from
other uses of superscripts in formal language theory. However,
we shall omit the parentheses when the context is unambigu-
ous.
each a66a102a98a116a103 consists of one indexed nonterminal and all
of these nonterminals are coindexed. As we shall
see, the notion of a link generalizes the notion of
nonterminal in context-free grammars: each pro-
duction rewrites a single link.
Definition 1 Let a17 a93 a95 be some integer con-
stant. A generalized multitext grammar with a17
dimensions (a17 -GMTG for short) is a tuple a117a118a42
a5
a37a39a38 a1a75a37 a58 a1a6a119 a1
a18
a9 where a37a39a38 , a37 a58 are finite, disjoint sets
of nonterminal and terminal symbols, respectively,
a18a120a52
a37 a38 is the start symbol and a119 is a finite set of
productions. Each production has the form a121 a3 a122 ,
where a121 is a a17 -dimensional link and a122 is a a17 -
dimensional ITV such that a113 a5 a121 a0 a106a27a2a24a9a123a42a124a113 a5 a122 a0 a106a27a2a24a9 for
a95a83a105a125a106a112a105 a17 . If a121
a0
a106a27a2 contains a18 , then a113
a5
a121
a0
a106a27a2a24a9a99a42a126a95 .
We omit symbol a17 from a17 -GMTG whenever it is
not relevant. To simplify notation, we write pro-
ductions as a127a128a42 a0a127
a7
a1a14a86a14a86a14a86a12a1a26a127a80a90 a2 , with each a127a39a98a129a42
a5a26a45
a98
a7
a1a14a86a14a86a14a86 a1
a45
a98
a87a8a130
a9 a3
a5
a121 a98
a7
a1a14a86a14a86a14a86 a1a85a121 a98
a87a8a130
a9 ,
a45
a98a104a103
a52
a37a39a38 . I.e.
we omit the unique index appearing on the LHS of
a127 . Each a127 a98 is called a production component. The
production component a5 a9 a3 a5 a9 is called the inactive
production component. All other production com-
ponents are called active and we set a131a71a132a18a133a75a68 a134a135a72 a5 a127 a9a136a42
a44
a106 a28a137a97a14a98a84a138a139a100a102a56 . Inactive production components are
used to relax synchronous rewriting on some dimen-
sions, that is to implement rewriting on a32a55a140 a17 com-
ponents. When a32 a42a141a95 , rewriting is licensed on one
component, independently of all the others.
Two grammar parameters play an important role
in this paper. Let a127a142a42 a0a127
a7
a1a14a86a14a86a14a86a12a1a26a127a80a90 a2
a52
a119 and a127a143a98a144a42
a5a26a45
a98
a7
a1a14a86a14a86a14a86 a1
a45
a98
a87a8a130
a9 a3
a5
a121 a98
a7
a1a14a86a14a86a14a86 a1a85a121 a98
a87a8a130
a9 .
Definition 2 The rank a145 of a production a127 is
the number of links on its RHS: a145 a5 a127 a9 a42
a28a32a68a50a69a146a70a147a72a75a74
a5
a121
a7 a7a20a148a14a148a14a148
a121
a7a2a87a6a88
a121
a8 a7a21a148a14a148a14a148
a121a149a90
a87a8a92
a9a38a28 . The rank of a
GMTG a117 is a145 a5 a117 a9a99a42a151a150a43a152a146a153 a154a135a155a146a156a157a145 a5 a127 a9 .
Definition 3 The fan-out of a127 a98 , a127 and a117 are, re-
spectively, a113 a5 a127 a98 a9a158a42a124a97 a98 , a113 a5 a127 a9a158a42a160a159 a90
a98a162a161
a7
a113
a5
a127 a98 a9 and
a113
a5
a117 a9a81a42a78a150a43a152a146a153 a154a163a155a146a156a164a113
a5
a127 a9 .
For example, the rank of Production (23) is two and
its fan-out is four.
In GMTG, the derives relation is defined over
ITVs. GMTG derivation proceeds by synchronous
application of all the active components in some
production. The indexed nonterminals to be rewrit-
ten simultaneously must all have the same index a54 ,
and all nonterminals indexed with a54 in the ITV must
be rewritten simultaneously. Some additional nota-
tion will help us to define rewriting precisely. A
reindexing is a one-to-one function on a40 , and is
extended to a37
a62 by letting
a35 a5a21a20
a9a165a42
a20 for a20a166a52
a37a167a58
and a35 a5a26a45a48a47a50a49a36a51 a9a151a42 a45a48a47a1a0a135a47a50a49a36a51a36a51 for a45a48a47a12a49a36a51a78a52 a41 a5 a37 a38 a9 . We
also extend a35 to strings in a37 a60a62 analogously. We
say that a121 a1a85a121 a79 a52 a37 a60a62 are independent if a68a12a69a71a70a147a72a75a74 a5 a121 a9a3a2
a68a50a69a146a70a147a72a75a74
a5
a121a21a79a44a9a99a42a5a4 .
Definition 4 Let a117 a42 a5 a37 a38 a1a75a37a91a58 a1a6a119 a1 a18 a9 be a
a17 -GMTG and let a127 a42
a0
a127
a7
a1a14a86a14a86a14a86a10a1a26a127 a90 a2 with a127
a52
a119
and a127a39a98a157a42
a5a26a45
a98
a7
a1a14a86a14a86a14a86a12a1
a45
a98
a87a8a130
a9 a3
a5
a121a21a98
a7
a1a14a86a14a86a14a86 a1a85a121a21a98
a87a2a130
a9 . Let
a66 and a6 be two ITVs with a66
a0
a106a27a2a21a42
a5
a66a91a98
a7
a1a14a86a14a86a14a86a12a1a82a66a102a98
a87a8a130
a9 and
a6
a0
a106a27a2a84a42
a5
a6a27a98
a7
a1a14a86a14a86a14a86a10a1a7a6a27a98
a87a2a130
a9 . Assume that a121 is some con-
catenation of all a121a149a98a116a103 and that a66 is some concatena-
tion of all a66a102a98a104a103 , a95a61a105a111a106a65a105 a17 , a95a48a105a142a108 a105a107a97a14a98 , and let a35 be
some reindexing such that strings a35 a5 a121 a9 and a66 are
independent. The derives relation a66a9a8 a154a10 a6 holds
whenever there exists an index a54 a52 a40 such that the
following two conditions are satisfied:
(i) for each a106 a52 a131a146a132a18a133a75a68 a134a146a72 a5 a127 a9 we have
a66a102a98
a7 a148a14a148a14a148
a66a73a98
a87a2a130
a42 a66a80a79
a98a12a11
a45
a47a50a49a36a51
a98
a7
a66a39a79
a98
a7
a45
a47a50a49a36a51
a98
a8
a148a14a148a14a148
a66a39a79
a98
a87a2a130a14a13 a7
a45
a47a12a49a36a51
a98
a87a2a130
a66a80a79
a98
a87a2a130such that
a54a16a15
a52
a68a50a69a146a70a147a72a75a74
a5
a66 a79
a98a1a11
a66a80a79
a98
a7
a148a14a148a14a148
a66a39a79
a98
a87a2a130
a9 , and each
a6a27a98a104a103 is obtained from a66a91a98a116a103 by replacing each
a45
a47a50a49a36a51
a98a104a103a18a17
with a35 a5 a121 a98a104a103a18a17 a9 ;
(ii) for each a106a19a15a52 a131a146a132a18a133a85a68 a134a146a72 a5 a127 a9 we have
a54a20a15
a52
a68a12a69a71a70a147a72a75a74
a5
a66a73a98
a7a112a148a14a148a14a148
a66a102a98
a87a8a130
a9 and a66
a0
a106a19a2 a42 a6
a0
a106a27a2 .
We generalize the a8
a154a10 relation to
a8
a10 and
a8a111a60
a10 in
the usual way, to represent derivations.
We can now introduce the notion of generated
language (or generated relation). A start link
of a a17 -GMTG is a a17 -dimensional link where at
least one component is a5a19a18 a47 a7 a51 a9 , a18 the start sym-
bol, and the rest of the components are a5 a9 . Thus,
there are a21 a90a23a22 a95 start links. The language
generated by a a17 -GMTG a117 is a24 a5 a117 a9a151a42 a44 a66a26a25 a28
a66a28a27a29a8 a60
a10
a66 a25 a1 a66a26a27 a start linka1 a66 a25
a0
a106a27a2a144a42
a5
a9 ora66 a25
a0
a106a27a2a144a42
a5a31a30
a98a13a9 with
a30
a98
a52
a37 a60
a58
a1a16a95a151a105 a106a158a105 a17 a56 . Each ITV in
a24
a5
a117 a9 is called a multitext. For every a17 -GMTG a117 ,
a24
a5
a117 a9 can be partitioned into a21
a90 a22
a95 subsets, each
containing multitexts derived from a different start
link. These subsets are disjoint, since every non-
empty tuple of a start link is eventually rewritten as
a string, either empty or not.5
A start production is a production whose LHS
is a start link. A GMTG writer can choose the com-
binations of components in which the grammar can
generate, by including start productions with the de-
sired combinations of active components. If a gram-
mar contains no start productions with a certain
combination of active components, then the corre-
sponding subset of a24 a5 a117 a9 will be empty. Allow-
ing a single GMTG a117 to generate multitexts with
5We are assuming that there are no useless nonterminals.
some empty tuples corresponds to modeling rela-
tions of different dimensionalities. This capability
enables a synchronous grammar to govern lower-
dimensional sublanguages/translations. For exam-
ple, an English/Italian GMTG can include Produc-
tion (9), an English CFG, and an Italian CFG. A
single GMTG can then govern both translingual
and monolingual information in applications. Fur-
thermore, this capability simplifies the normaliza-
tion procedure described in Section 6. Otherwise,
this procedure would require exceptions to be made
when eliminating epsilons from start productions.
4 Generative Capacity
In this section we compare the generative capac-
ity of GMTG with that of mildly context-sensitive
grammars. We focus on LCFRS, using the no-
tational variant introduced by Rambow and Satta
(1999), briefly summarized below. Throughout this
section, strings a30a118a52 a37a61a60
a58
and vectors of the form
a0a6a5a31a30
a9a13a2 will be identified. For lack of space, some
proofs are only sketched, or entirely omitted when
relatively intuitive: Melamed et al. (2004) provide
more details.
Let a37a59a58 be some terminal alphabet. A function a36
has rank a32a164a93a111a100 if it is defined on a5 a37 a60
a58
a9
a0
a88a34a33
a5
a37 a60
a58
a9
a0a36a35
a33
a148a14a148a14a148
a33
a5
a37a61a60
a58
a9
a0a38a37 , for integers a35
a98a99a93a96a95 , a95a67a105 a106a65a105a39a32 . Also,
a36 has fan-out a35
a93 a95 if its range is a subset of
a5
a37a164a60
a58
a9
a0 .
Let a40a42a41 , a43a39a98a116a103 , a95 a105a45a44a107a105 a35 , a95 a105a126a106a61a105a46a32 and a95 a105a120a108a109a105
a35
a98 , be string-valued variables. Function
a36 is linear
regular if it is defined by an equation of the form
a36 a5a48a47
a43
a7 a7
a1a14a86a14a86a14a86a10a1a48a43
a7
a0
a88
a49
a1a14a86a14a86a14a86a12a1
a47
a43a51a50
a7
a1a14a86a14a86a14a86 a1a48a43 a50
a0a38a37
a49
a9
a42
a47
a40
a7
a1a14a86a14a86a14a86a12a1a48a40
a0
a49 (24)
where a47 a40
a7
a1a14a86a14a86a14a86a10a1a48a40
a0
a49 represents some grouping into a35
strings of all and only the variables appearing in the
left-hand side, possibly with some additional termi-
nal symbols. (Symbols a145 , a113 and a8
a10 are overloaded
below.)
Definition 5 A Linear Context-Free Rewrit-
ing System (LCFRS) is a quadruple
a117 a42
a5
a37 a38 a1a75a37a59a58 a1a6a119 a1
a18
a9 where a37 a38 , a37a59a58 and
a18 are
as in GMTGs, every a45 a52 a37 a38 is associated
with an integer a113 a5a26a45 a9a124a93 a95 with a113 a5a19a18 a9a53a42 a95 ,
and a119 is a finite set of productions of the form
a45
a3
a36 a5a53a52
a7
a1
a52
a8
a1a14a86a14a86a14a86a10a1
a52a55a54
a47a57a56a75a51
a9 , where a145
a5a37a36
a9 a93 a100 ,
a45
a1
a52
a98
a52
a37a39a38 , a95a43a105a78a106a84a105 a145
a5a37a36
a9 and where
a36 is a linear
regular function having rank a145 a5a37a36 a9 and fan-out
a113
a5a26a45
a9 , defined on a5 a37a67a60
a58
a9a36a58
a47a12a59
a88
a51
a33
a148a14a148a14a148
a33
a5
a37a61a60
a58
a9 a58
a47a12a59a51a60a18a61a62a36a63a26a51 .
For every a45 a52 a37a80a38 and a64 a52 a5 a37a61a60
a58
a9a36a58
a47a57a65 a51 , we write
a45
a8
a10
a64 if
(i) a45 a3
a36 a5
a9
a52
a119 and a36 a5 a9a99a42a5a64 ; or else
(ii) a45 a3 a36 a5a53a52
a7
a1a14a86a14a86a14a86a12a1
a52 a54
a47 a56a75a51
a9
a52
a119 ,
a52
a98 a8
a10
a64a18a98
a52
a5
a37a67a60
a58
a9a36a58
a47a12a59
a130
a51 for every
a95 a105 a106 a105 a145
a5a37a36
a9 , and
a36 a5
a64
a7
a1a14a86a14a86a14a86a10a1a18a64
a54
a47a57a56a75a51
a9 a42 a64 .
The language generated by a117 is defined as a24
a5
a117 a9a81a42
a44 a30
a28
a18
a8
a10 a5a31a30
a9a11a1
a30 a52
a37a48a60
a58
a56 . Let a127
a52
a119 ,
a127a126a42
a45
a3
a36 a5a53a52
a7
a1
a52
a8
a1a14a86a14a86a14a86a10a1
a52 a54
a47a57a56a75a51
a9 . The rank of a127
and a117 are, respectively, a145 a5 a127 a9a61a42 a145 a5a37a36 a9 and a145 a5 a117 a9a157a42
a150a43a152a146a153a163a154a135a155a146a156a61a145
a5
a127 a9 . The fan-out of a127 and a117 are, respec-
tively, a113 a5 a127 a9a99a42 a113 a5a26a45 a9 and a113 a5 a117 a9a81a42a78a150a43a152a146a153 a154a163a155a146a156 a113 a5 a127 a9 .
The proof of the following theorem is relatively
intuitive and therefore omitted.
Theorem 1 For any LCFRS a117 , there exists some
1-GMTG a117 a79 with a145 a5 a117a57a79a44a9a78a42 a145 a5 a117 a9 and a113 a5 a117 a79a44a9a78a42
a113
a5
a117 a9 such that a24
a5
a117 a79a44a9 a42 a24
a5
a117 a9 .
Next, we show that the generative capacity of
GMTG does not exceed that of LCFRS. In order
to compare string tuples with bare strings, we in-
troduce two special functions ranging over multi-
texts. Assume two fresh symbols a1 a1a3a2 a15a52 a5 a37a39a58 a64
a37 a38 a9 . For a multitext a66 we write a4a146a131a146a70
a5
a66 a9 a42
a66a39a79 , where a66a80a79
a0
a106a27a2 a42
a5
a2 a9 if a66
a0
a106a27a2 a42
a5
a9 and
a66a39a79
a0
a106a27a2 a42 a66
a0
a106a19a2 otherwise, a95a118a105 a106a115a105 a17 . For
a multitext a0a6a5a31a30
a7
a9a11a1
a5a31a30
a8
a9a11a1a14a86a14a86a14a86 a1
a5a31a30
a90 a9a13a2 with no empty
tuple, we write a5a7a6a9a8a71a72 a5 a0a6a5a31a30
a7
a9a11a1
a5a31a30
a8
a9a11a1a14a86a14a86a14a86a10a1
a5a31a30
a90 a9a13a2a21a9 a42
a30
a7
a1 a30
a8
a1
a148a14a148a14a148
a1 a30
a90 . We extend both functions to
sets of multitexts in the obvious way: a5a7a6a10a8a146a72 a5 a24 a9 a42
a44
a5a7a6a10a8a146a72
a5a12a11
a9 a28
a11a33a52
a24a144a56 and a4a146a131a146a70
a5
a24 a9a81a42
a44
a4a146a131a146a70
a5a12a11
a9 a28
a11a149a52
a24a65a56 .
In a a17 -GMTG, a production with a32 active com-
ponents, a95a151a105 a32 a105 a17 , is said to be a32 -active. A
a17 -GMTG whose start productions are all a17 -active
is called properly synchronous.
Lemma 1 For any properly synchronous a17 -GMTG
a117 , there exists some LCFRS a117 a79 with a145
a5
a117a57a79a44a9a112a42a78a145
a5
a117 a9
and a113 a5 a117a57a79a37a9 a42a129a113 a5 a117 a9 such that a24 a5 a117 a79a44a9a99a42a13a5a7a6a10a8a71a72 a5 a24 a5 a117 a9 a9 .
Outline of the proof. We set a117 a79 a42 a5 a37 a79
a38
a1a80a37a59a58 a1a143a119 a79 a1
a0a18
a2a24a9 , where a37 a79
a38
a42
a44 a0
a127 a1a89a54a13a2 a28a39a127
a52
a119 a1a99a54
a52
a68a12a69a71a70a147a72a75a74
a5
a117 a9a85a56a143a64
a44 a0a18
a2a30a56 , a68a50a69a146a70a147a72a75a74
a5
a117 a9 is the set of all indexes appearing
in the productions of a117 , and a119a67a79 is constructed as
follows. Let a127 a1 a127a39a79 a52 a119 with a127a94a42 a0a127
a7
a1a14a86a14a86a14a86a10a1a26a127 a90 a2 ,
a127a143a79 a42
a0
a127a143a79
a7
a1a14a86a14a86a14a86a10a1a26a127a143a79
a90
a2 , a127a143a98 a42
a5a26a45
a98
a7
a1a14a86a14a86a14a86a12a1
a45
a98 a90 a9 a3
a5
a121a20a98
a7
a1a14a86a14a86a14a86 a1a85a121a21a98
a87a8a130
a9 , and a127a39a79
a98
a42
a5a53a52
a98
a7
a1a14a86a14a86a14a86 a1
a52
a98 a90 a9 a3
a5
a122 a98
a7
a1a14a86a14a86a14a86a10a1a89a122 a98
a87
a17
a130
a9 . Assume that a127 can rewrite the right-
hand side of a127a80a79 , that is
a0a6a5
a122
a7 a7
a1a14a86a14a86a14a86a12a1a89a122
a7a2a87
a17
a88
a9a11a1a14a86a14a86a14a86 a1
a5
a122 a90
a7
a1a14a86a14a86a14a86 a1a89a122 a90
a87
a17
a92
a9a13a2
a8
a154a10
a0a6a5
a6
a7 a7
a1a14a86a14a86a14a86a12a1a7a6
a7a2a87a89a88
a9a11a1a14a86a14a86a14a86 a1
a5
a6a14a90
a7
a1a14a86a14a86a14a86a10a1a7a6a14a90
a87a23a92
a9a13a2 a86
Then there must be at least one index a54 such that for
each a106
a52
a131a146a132a18a133a85a68 a134a146a72
a5
a127 a9 , a5 a122a143a98
a7
a1a14a86a14a86a14a86a10a1a89a122 a98
a87
a17
a130
a9 contains exactly
a97a14a98 occurrences of a54 .
Let a121 a154 a42 a121
a7 a7a21a148a14a148a14a148
a121
a7a2a87a89a88
a121
a8 a7a21a148a14a148a14a148
a121a149a90
a87a8a92
. Also let
a68a50a69a146a70a147a72a75a74
a5
a121 a154 a9a142a42
a44
a54
a7
a1a14a86a14a86a14a86a12a1a89a54
a54
a47
a154
a51
a56 and let a113
a5
a54a82a98a27a9 be the
number of occurrences of a54a89a98 appearing in a121 a154 . We
define an alphabet a23 a154a124a42 a44 a43a39a98a104a103 a28 a95 a105 a106 a105
a145
a5
a127 a9a11a1a151a95a120a105 a108 a105 a113
a5
a54a23a98a13a9a85a56 . For each a106 and a108 with
a95 a105 a106a101a105 a17 , a106
a52
a131a146a132a18a133a85a68 a134a146a72
a5
a127 a9 and a95a126a105 a108 a105a160a97a16a98 ,
we define a string a44 a5 a127 a1a89a106 a1a2a108 a9 over a23 a154a110a64a142a37a91a58 as fol-
lows. Let a121 a98a104a103a110a42 a25
a7
a25
a8a33a148a14a148a14a148
a25a15a14 , each a25a15a16a77a52
a37
a62 . Then
a44
a5
a127 a1a89a106 a1a2a108 a9a76a42
a25
a79
a7
a25
a79
a8
a148a14a148a14a148
a25
a79
a14 , where
a17
a25
a79
a16
a42
a25a18a16 in case a25a19a16a67a52
a37a59a58 ; and
a17
a25
a79
a16
a42 a43
a49a12a20a21 in case
a25a19a16a101a52
a41
a5
a37 a38 a9 , where a54 is
the index of a25a19a16 and the indicated occurrence
of a25a15a16 is the a22 -th occurrence of such symbol
appearing from left to right in string a121 a154 .
Next, for every possible a127 , a127a80a79 , and a54 as above, we
add to a119a83a79 a production
a127
a49
a42
a0
a127 a79 a1a89a54 a2 a3
a36 a5 a0
a127 a1a89a54
a7
a2a19a1a14a86a14a86a14a86 a1
a0
a127 a1a89a54
a54
a47
a154
a51
a2a24a9a11a1
where
a36 a5a48a47
a43
a7 a7
a1a14a86a14a86a14a86a12a1a48a43
a7 a58
a47a50a49
a88
a51
a49
a1a14a86a14a86a14a86a10a1
a47
a43
a54
a47
a154
a51
a7
a1a14a86a14a86a14a86a12a1a48a43
a54
a47
a154
a51
a58
a47a12a49 a60a48a61a23a7a63 a51
a49
a9
a42
a47
a44
a5
a127 a1a16a95a42a1a16a95a10a9a11a1a14a86a14a86a14a86a42a1a7a44
a5
a127 a1 a17 a1a6a97a71a90 a9
a49
(each a44 a5 a127 a1a89a106 a1a2a108 a9 above satisfies a106 a52 a131a71a132a18a133a75a68 a134a135a72 a5 a127 a9 ). Note
that a36 is a function with rank a145 a5 a127 a9 and fan-out
a159
a90
a98a162a161
a7
a97a14a98 a42 a113
a5
a127 a9 . Thus we have a145
a5
a127
a49
a9 a42 a145
a5
a127 a9
and a113 a5 a127 a49 a9 a42 a113 a5 a127 a9 . Without loss of generality,
we assume that a117 contains only one production
with a18 appearing on the left-hand side, having the
form a127 a27 a42 a0a6a5a19a18 a9a11a1a14a86a14a86a14a86a10a1 a5a19a18 a9a13a2 a3 a0a6a5a26a45 a7 a9a11a1a14a86a14a86a14a86 a1 a5a26a45 a7 a9a13a2 .
To complete the construction of a119a67a79 , we then
add a last production a0a18 a2 a3 a36 a5 a0a127 a27 a1a16a95 a2a24a9 where
a36 a5a48a47
a43
a7 a7
a1a48a43
a7a13a8
a1a14a86a14a86a14a86 a1a48a43
a7
a90
a49
a9a125a42
a47
a43
a7 a7
a1
a43
a7a13a8
a1
a148a14a148a14a148
a1
a43
a7
a90
a49 .
We claim that, for each a127 , a127 a79 and a54 as above
a0a6a5a26a45
a7
a7 a7
a1a14a86a14a86a14a86a12a1
a45
a7
a7a2a87a89a88
a9a11a1a14a86a14a86a14a86a10a1
a5a26a45
a7
a90
a7
a1a14a86a14a86a14a86a12a1
a45
a7
a90
a87a8a92
a9a13a2
a8a125a60
a10 a0a6a5a25a24
a7 a7
a1a14a86a14a86a14a86a12a1
a24
a7a2a87a89a88
a9a11a1a14a86a14a86a14a86a10a1
a5a25a24
a90
a7
a1a14a86a14a86a14a86a10a1
a24
a90
a87 a92
a9a13a2
iff a0a127a39a79a24a1a89a54a13a2a3a8
a10
a17
a47a25a24
a7 a7
a1a14a86a14a86a14a86a10a1
a24
a7a2a87a89a88
a1
a24
a8 a7
a1a14a86a14a86a14a86 a1
a30
a90
a87a8a92
a49 . The
lemma follows from this claim.
The proof of the next lemma is relatively intuitive
and therefore omitted.
Lemma 2 For any a17 -GMTG a117 , there exists a prop-
erly synchronous a17 -GMTG a117 a79 such that a145
a5
a117 a79a44a9a43a42
a145
a5
a117 a9 , a113
a5
a117a57a79a37a9a139a42 a150a43a152a146a153
a44
a113
a5
a117 a9a11a1 a17a114a56 , and a24
a5
a117a57a79a44a9a120a42
a4a146a131a146a70
a5
a24
a5
a117 a9 a9 .
Combining Lemmas 1 and 2, we have
Theorem 2 For any a17 -GMTG a117 , there exists
some LCFRS a117 a79 with a145 a5 a117a57a79a44a9 a42 a145 a5 a117 a9 and
a113
a5
a117a57a79a37a9 a42 a150a43a152a146a153
a44
a113
a5
a117 a9a11a1 a17 a56 such that a24
a5
a117 a79a44a9a53a42
a5a7a6a10a8a146a72
a5
a4a146a131a146a70
a5
a24
a5
a117 a9 a9 a9 .
5 Weak Language Preservation Property
GMTGs have the weak language preservation prop-
erty, which is one of the defining requirements of
synchronous rewriting systems (Rambow and Satta,
1996). Informally stated, the generative capacity of
the class of all component grammars of a GMTG
exactly corresponds to the class of all projected lan-
guages. In other words, the interaction among dif-
ferent grammar components in the rewriting process
of GMTG does not increase the generative power
beyond the above mentioned class. The next result
states this property more formally.
Let a117 be a a17 -GMTG with production set a119 .
For a95a53a105 a106 a105 a17 , the a106 -th component gram-
mar of a117 , written a4a1a0a3a2 a4 a5 a117 a1a89a106 a9 , is the 1-GMTG
with productions a119a33a98a126a42 a44 a127a39a98 a28 a0a127
a7
a1a14a86a14a86a14a86 a1a26a127a80a90 a2
a52
a119 a1 a127a143a98 a15a42
a5
a9 a3
a5
a9a85a56 . Similarly, the a106 -th
projected language of a24
a5
a117 a9 is a4a5a0a6a2
a4 a5
a24
a5
a117 a9a11a1a89a106 a9 a42
a44 a30
a98 a28
a0a6a5a31a30
a7
a9a11a1a14a86a14a86a14a86 a1
a5a31a30
a90 a9a13a2
a52
a24
a5
a117 a9a11a1
a5a31a30
a98a19a9 a15a42
a5
a9a85a56 . In general a24
a5
a4a1a0a3a2
a4 a5
a117 a1a89a106 a9 a9 a15a42 a4a5a0a6a2
a4 a5
a24
a5
a117 a9a11a1a89a106 a9 ,
because component grammars a4a5a0a6a2 a4 a5 a117 a1a89a106 a9 inter-
act with each other in the rewriting process of
a117 . To give a simple example, consider the 2-
GMTG a117 with productions a0a6a5a19a18 a9a11a1 a5a19a18 a9a13a2 a3 a0a6a5a21a22 a9a11a1 a5a21a22 a9a13a2 ,
a0a6a5a19a18
a9a11a1
a5a19a18
a9a13a2a31a3
a0a6a5a21a20a73a45a48a47
a7
a51
a9a11a1
a5a21a20 a18a112a47
a7
a51
a9a13a2 and
a0a6a5a26a45
a9a11a1
a5a19a18
a9a13a2 a3
a0a6a5a19a18 a47
a7
a51
a9a11a1
a5a19a18 a47
a7
a51 a26
a9a13a2 . Then a24
a5
a117 a9 a42
a44 a0a6a5a21a20a8a7
a9a11a1
a5a21a20a8a7 a26a9a7
a9a13a2 a28
a10
a93a141a100a102a56 , and thus a4a5a0a6a2
a4 a5
a24
a5
a117 a9a11a1a7a21 a9a164a42
a44a10a20a8a7 a26a9a7
a28
a10
a93
a100a102a56 . On the other hand, a24
a5
a4a1a0a3a2
a4 a5
a117 a1a7a21 a9 a9a164a42
a44a10a20a11a7 a26 a21
a28
a10
a1 a22 a93a139a100a102a56 . Let a12
a5 LCFRS
a9 be the class of all lan-
guages generated by LCFRSs. Also let a12 a154 a47 a10 a51 and
a12 a154
a47a14a13a91a51 be the classes of languages
a24
a5
a4a5a0a6a2
a4 a5
a117 a1
a32
a9 a9 and
a4a5a0a6a2
a4 a5
a24
a5
a117 a9a11a1
a32
a9 a9 , respectively, for every a17a124a93 a95 , ev-
ery a17 -GMTG a117 and every a32 with a95a83a105 a32 a105 a17 .
Theorem 3 a12 a154 a47 a10 a51 a42 a12 a5 a24a16a15a18a17a20a19 a18 a9 and a12 a154 a47a21a13a102a51 a42
a12
a5
a24a22a15a18a17a23a19
a18
a9 .
Proof. The a24 cases directly follow from Theo-
rem 1.
Let a117 be some a17 -GMTG and let a32 be an integer
such that a95 a105 a32 a105 a17 . It is not difficult to see that
a5a7a6a10a8a146a72
a5
a4a146a131a146a70
a5
a24
a5
a4a5a0a6a2
a4 a5
a117 a1
a32
a9 a9 a9 a9a137a42a9a24
a5
a4a5a0a6a2
a4 a5
a117 a1
a32
a9 a9 . Hence
a24
a5
a4a5a0a6a2
a4 a5
a117 a1
a32
a9 a9 can be generated by some LCFRS, by
Theorem 2.
We now define a LCFRS a117 a79 such that
a24
a5
a117 a79a14a9a139a42 a4a5a0a6a2
a4 a5
a4a135a131a71a70
a5
a24
a5
a117 a9 a9a11a1
a32
a9 a9 . Assume a117 a79 a79a125a42
a5
a37a39a38 a1a75a37 a58 a1a6a119 a1
a18
a9 is a properly synchronous a17 -GMTG
generating a4a146a131a146a70 a5 a24 a5 a117 a9 a9 (Lemma 2). Let a117a83a79a166a42
a5
a37 a79
a38
a1a75a37a59a58 a1a6a119 a79 a1
a0a18
a2a24a9 , where a37 a79
a38
and a119 a79 are constructed
from a117 a79 a79 almost as in the proof of Lemma 1.
The only difference is in the definition of strings
a44
a5
a127 a1a89a106 a1a2a108 a9 and the production rewriting
a0a18
a2 , speci-
fied as follows (we use the same notation as in the
proof of Lemma 1). a44
a5
a127 a1a89a106 a1a2a108 a9a76a42
a25
a79
a7
a25
a79
a8
a148a14a148a14a148
a25
a79
a14 , where
for each a25 : (i) a25 a79
a16
a42
a25a15a16 if a25a15a16a101a52
a37a91a58 and a106 a42
a32 ;
(ii) a25 a79a16 a42 a22 if a25a15a16 a52 a37a91a58 and a106 a15a42 a32 ; (iii) a25 a79a16 a42 a43 a49a12a20a21
if a25a18a16a123a52 a41 a5 a37 a38 a9 , with a54 , a22 as in the original proof.
Finally, the production rewriting a0a18 a2 has the form
a0a18
a2 a3
a36 a5 a0
a127a28a27 a1a16a95 a2a24a9 , where a36 a5a48a47 a43
a7 a7
a1a48a43
a7a13a8
a1a14a86a14a86a14a86a10a1a48a43
a7
a90
a49
a9a61a42
a47
a43
a7 a7
a43
a7a13a8a33a148a14a148a14a148
a43
a7
a90
a49 . To conclude the proof, note that
a4a5a0a6a2
a4 a5
a24
a5
a117 a9a11a1
a32
a9 a9 and a4a5a0a6a2
a4 a5
a4a146a131a146a70
a5
a24
a5
a117 a9 a9a11a1
a32
a9 a9 can differ
only with respect to string a2 . The theorem then fol-
lows from the fact that LCFRS is closed under in-
tersection with regular languages (Weir, 1988).
6 Generalized Chomsky Normal Form
Certain kinds of text analysis require a grammar in a
convenient normal form. The prototypical example
for CFG is Chomsky Normal Form (CNF), which is
required for CKY-style parsing. A a17 -GMTG is in
Generalized Chomsky Normal Form (GCNF) if it
has no useless links or useless terminals, and every
production is in one of two forms:
(i) A nonterminal production has rank = 2 and
no terminals or a22 ’s on the RHS.
(ii) A terminal production has exactly one com-
ponent of the form a45 a3
a20 , where a45a120a52
a37 a38 and
a20 a52
a37a59a58 . The other components are inactive.
The algorithm to convert a GMTG to GCNF has
the following steps: (1) add a new start-symbol (2)
isolate terminals, (3) binarize productions, (4) re-
move a22 ’s, (5) eliminate useless links and terminals,
and (6) eliminate unit productions. The steps are
generalizations of those presented by Hopcroft et al.
(2001) to the multidimensional case with disconti-
nuities. The ordering of these steps is important, as
some steps can restore conditions that others elim-
inate. Traditionally, the terminal isolation and bi-
narization steps came last, but the alternative order
reduces the number of productions that can be cre-
ated during a22 -elimination. Steps (1), (2), (5) and (6)
are the same for CFG and GMTG, except that the
notion of nonterminal in CFG is replaced with links
in GMTG. Some complications arise, however, in
the generalization of steps (3) and (4).
6.1 Step 3: Binarize
The third step of converting to GCNF is binarization
of the productions, making the rank of the grammar
two. For a32a164a93a111a100 and a35 a93 a95 , we write D-GMTG
a47
a50
a51
a0 to
represent the class of all a17 -GMTGs with rank a32 and
fan-out a35 . A CFG can always be binarized into an-
other CFG: two adjacent nonterminals are replaced
with a single nonterminal that yields them. In con-
trast, it can be impossible to binarize a a17 -GMTG
a47
a50
a51
a0
into an equivalent a17 -GMTGa8a0 . From results pre-
sented by Rambow and Satta (1999) it follows that,
a0 (S)
(S)a1 a3
a0 a5 N
a7PatVa8wentPa40homeAa41earlya9
a5 P
a40damoyNa7PatAa41ranoVa8pashola9 a1
Pat went home early
damoy
Pat
rano
pashol
Figure 1: A production that requires an increased
fan-out to binarize, and its 2D illustration.
for every fan-out a35 a93 a21 and rank a32 a93a1a0 , there
are some index orderings that can be generated by
a17 -GMTG
a47
a50
a51
a0 but not a17 -GMTG
a47
a50
a13 a7
a51
a0 . The distin-
guishing characteristic of such index orderings is
apparent in Figure 1, which shows a production in
a grammar with fan-out two, and a graph that illus-
trates which nonterminals are coindexed. No two
nonterminals are adjacent in both components, so
replacing any two nonterminals with a single non-
terminal causes a discontinuity. Increasing the fan-
out of the grammar allows a single nonterminal to
rewrite as non-adjacent nonterminals in the same
string. Increasing the fan-out can be necessary even
for binarizing a 1-GMTG production such as:
a0a6a5 S,S
a9a13a2 a3
a0a6a5 N
a7 Va8 Pa40 Aa41 a1 Pa40 Na7 Aa41 Va8 a9a13a2 (25)
To binarize, we nondeterministically split each
nonterminal production a127a26a11 of rank a32 a138 a21 into two
nonterminal productions a127
a7
and a127
a8
of rank a140 a32 , but
possibly with higher fan-out. Since this algorithm
replaces a32 with two productions that have rank a140 a32 ,
recursively applying the algorithm to productions of
rank greater than two will reduce the rank of the
grammar to two. The algorithm follows:
(i) Nondeterministically chose a10 links to be re-
moved from a127 a11 and replaced with a single link
to make a127
a7
, where a21a125a105
a10
a105 a32
a22
a95 . We call
these links the m-links.
(ii) Create a new ITV a66 . Two nonterminals are
neighbors if they are adjacent in the same
string in a production RHS. For each set of m-
link neighbors in component a32 in a127 a11 , place that
set of neighbors into the a32 ’th component of a66
in the order in which they appeared in a127 a11 , so
that each set of neighbors becomes a different
string, for a95a83a105 a32 a105 a17 .
(iii) Create a new unique nonterminal, say a52 , and
replace each set of neighbors in production a127 a11
with a52 , to create a127
a7
. The production a127
a8
is
a0a52
a1a14a86a14a86a14a86a12a1
a52
a2 a3 a66
For example, binarization of the productions for the
English/Russian multitext [(Pat went home early),
(damoy Pat rano pashol)]6 in Figure 1 requires that
we increase the fan-out of the language to three. The
binarized productions are as follows:
a0
a5 S
a9
a5 S
a9 a1
a3
a0
a5 N
a7PatVPa8 a9
a5 VP
a8 Na7PatVPa8 a9 a1
(26)
a0
a5 VP
a9
a5 VP
a1 VPa9 a1
a3
a0 a5 V
a7 Aa8earlya9
a5 V
a7 a1 Aa8ranoVa7 a9
a1
(27)
a0
a5 V
a9
a5 V
a1 Va9a8a1
a3
a0 a5 V
a7wentPa8homea9
a5 P
a8damoya1 Va7pashola9a2a1
(28)
6.2 Step 4: Eliminate a22 ’s
Grammars in GCNF cannot have a22 ’s in their
productions. Thus, GCNF is a more restrictive
normal form than those used by Wu (1997) and
Melamed (2003). The absence of a22 ’s simplifies
parsers for GMTG (Melamed, 2004). Given a
GMTG a117 with a22 in some productions, we give
the construction of a weakly equivalent gram-
mar a117a57a79 without any a22 ’s. First, determine all
nullable links and associated strings in a117 . A
link a2 a42 a0a6a5a26a45
a7
a1a14a86a14a86a14a86a12a1
a45
a7
a9a11a1a14a86a14a86a14a86a10a1
a5a26a45
a90 a1a14a86a14a86a14a86a10a1
a45
a90 a9a13a2
is nullable if a2
a60
a8 a66 , where a66 a42
a0a6a5
a121
a7 a7
a1a14a86a14a86a14a86a10a1a85a121
a7a2a87a89a88
a9a11a1a14a86a14a86a14a86a10a1
a5
a121a149a90
a7
a1a14a86a14a86a14a86a10a1a85a121a149a90
a87a8a92
a9a13a2 is an
ITV where at least one a121a149a98a104a103 is a22 . We say the link
a2 is nullable and the string at address
a5a21a32
a1a6a97 a9 in
a2 is nullable. For each nullable link, we create
a21
a7 versions of the link, where
a10 is the number of
nullable strings of that link. There is one version for
each of the possible combinations of the nullable
strings being present or absent. The version of the
link with all strings present is its original version.
Each non-original version of the link (except in the
case of start links) gets a unique subscript, which is
applied to all the nonterminals in the link, so that
each link is unique in the grammar. We construct
a new grammar a117 a79 whose set of productions a119 a79
is determined as follows: for each production, we
identify the nullable links on the RHS and replace
them with each combination of the non-original
versions found earlier. If a string is left empty
during this process, that string is removed from the
RHS and the fan-out of the production component
is reduced by one. The link on the LHS is replaced
with its appropriate matching non-original link.
There is one exception to the replacements. If a
production consists of all nullable strings, do not
include this case. Lastly, we remove all strings on
the RHS of productions that have a22 ’s, and reduce
the fan-out of the productions accordingly. Once
6The Russian is topicalized but grammatically correct.
again, we replace the LHS link with the appropriate
version.
Consider the example grammar:
a0a6a5a19a18
a9a11a1
a5a19a18
a9a13a2a16a3
a0a6a5a26a45
a7
a52
a8
a45
a7 a9a11a1
a5a53a52
a8
a45
a7 a9a13a2 (29)
a0a6a5a26a45
a1
a45
a9a11a1
a5a26a45
a9a13a2 a3
a0a6a5a21a20
a1
a52
a7 a9a11a1
a5a53a52
a7 a9a13a2 (30)
a0a6a5a53a52
a9a11a1
a5a53a52
a9a13a2a15a3
a0a6a5a27a26
a9a11a1
a5a21a22
a9a13a2 (31)
a0a6a5a53a52
a9a11a1
a5a53a52
a9a13a2a15a3
a0a6a5a27a26
a9a11a1
a5a27a26 a26
a9a13a2 (32)
We first identify which links are nullable. In this
case a0a6a5a26a45 a1 a45 a9a11a1 a5a26a45 a9a13a2 and a0a6a5a53a52 a9a11a1 a5a53a52 a9a13a2 are nullable so we
create a new version of both links: a0a6a5a26a45
a7
a1
a45
a7
a9a11a1
a5
a9a13a2
and a0a6a5a53a52
a7
a9a11a1
a5
a9a13a2 . We then alter the productions. Pro-
duction (31) gets replaced by (40). A new produc-
tion based on (30) is Production (38). Lastly, Pro-
duction (29) has two nullable strings on the RHS,
so it gets altered to add three new productions, (34),
(35) and (36). The altered set of productions are the
following:
a0a6a5a19a18
a9a11a1
a5a19a18
a9a13a2 a3
a0a6a5a26a45
a7
a52
a8
a45
a7 a9a11a1
a5a53a52
a8
a45
a7 a9a13a2 (33)
a0a6a5a19a18
a9a11a1
a5a19a18
a9a13a2 a3
a0a6a5a26a45
a7
a52
a8
a7
a45
a7 a9a11a1
a5a26a45
a7 a9a13a2 (34)
a0a6a5a19a18
a9a11a1
a5a19a18
a9a13a2 a3
a0a6a5a26a45
a7
a7
a52
a8
a45
a7
a7
a9a11a1
a5a53a52
a8 a9a13a2 (35)
a0a6a5a19a18
a9a11a1
a5
a9a13a2 a3
a0a6a5a26a45
a7
a7
a52
a8
a7
a45
a7
a7
a9a11a1
a5
a9a13a2 (36)
a5a26a45
a1
a45
a9a11a1
a5a26a45
a9a13a2 a3
a0a6a5a21a20
a1
a52
a7 a9a11a1
a5a53a52
a7 a9 (37)
a0a6a5a26a45
a7
a1
a45
a7
a9a11a1
a5
a9a13a2a16a3
a0a6a5a21a20
a1
a52
a7
a7
a9a11a1
a5
a9a13a2 (38)
a0a6a5a53a52
a9a11a1
a5a53a52
a9a13a2a16a3
a0a6a5a27a26
a9a11a1
a5a27a26 a26
a9a13a2 (39)
a0a6a5a53a52
a7
a9a11a1
a5
a9a13a2a16a3
a0a6a5a27a26
a9a11a1
a5
a9a13a2 (40)
Melamed et al. (2004) give more details about
conversion to GCNF, as well as the full proof of our
final theorem:
Theorem 4 For each GMTG a117 there exists a
GMTG a117 a79 in GCNF generating the same set of mul-
titexts as a117 but with each a5a21a22 a9 component in a multi-
text replaced by a5 a9 .
7 Conclusions
Generalized Multitext Grammar is a convenient and
intuitive model of parallel text. In this paper, we
have presented some formal properties of GMTG,
including proofs that the generative capacity of
GMTG is comparable to ordinary LCFRS, and that
GMTG has the weak language preservation prop-
erty. We also proposed a synchronous generaliza-
tion of Chomsky Normal Form, laying the founda-
tion for synchronous CKY parsing under GMTG. In
future work, we shall explore the empirical proper-
ties of GMTG, by inducing stochastic GMTGs from
real multitexts.
Acknowledgments
Thanks to Owen Rambow and the anonymous re-
viewers for valuable feedback. This research was
supported by an NSF CAREER Award, the DARPA
TIDES program, the Italian MIUR under project
PRIN No. 2003091149 005, and an equipment gift
from Sun Microsystems.
References
A. Aho and J. Ullman. 1969. Syntax directed translations and
the pushdown assembler. Journal of Computer and System
Sciences, 3:37–56, February.
T. Becker, A. Joshi, and O. Rambow. 1991. Long-distance
scrambling and tree adjoining grammars. In Proceedings of
the 5th Meeting of the European Chapter of the Association
for Computational Linguistics (EACL), Berlin, Germany.
E. Bertsch and M. J. Nederhof. 2001. On the complexity
of some extensions of RCG parsing. In Proceedings of
the 7th International Workshop on Parsing Technologies
(IWPT), pages 66–77, Beijing, China.
M. Dras and T. Bleam. 2000. How problematic are clitics for
S-TAG translation? In Proceedings of the 5th International
Workshop on Tree Adjoining Grammars and Related For-
malisms (TAG+5), Paris, France.
J. Hopcroft, R. Motwani, and J. Ullman. 2001. Introduction to
Automota Theory, Languages and Computation. Addison-
Wesley, USA.
I. Dan Melamed, G. Satta, and B. Wellington. 2004. Gener-
alized multitext grammars. Technical Report 04-003, NYU
Proteus Project. http://nlp.cs.nyu.edu/pubs/.
I. Dan Melamed. 2003. Multitext grammars and synchronous
parsers. In Proceedings of the Human Language Technology
Conference and the North American Association for Com-
putational Linguistics (HLT-NAACL), pages 158–165, Ed-
monton, Canada.
I. Dan Melamed. 2004. Statistical machine translation by pars-
ing. In Proceedings of the 42nd Annual Meeting of the As-
sociation for Computational Linguistics (ACL), Barcelona,
Spain.
O. Rambow and G. Satta. 1996. Synchronous models of lan-
guage. In Proceedings of the 34th Annual Meeting of the As-
sociation for Computational Linguistics (ACL), Santa Cruz,
USA.
O. Rambow and G. Satta. 1999. Independent parallelism in
finite copying parallel rewriting systems. Theoretical Com-
puter Science, 223:87–120, July.
O. Rambow. 1995. Formal and Computational Aspects of Nat-
ural Language Syntax. Ph.D. thesis, University of Pennsyl-
vania, Philadelphia, PA.
S. Shieber. 1994. Restricting the weak-generative capactiy of
synchronous tree-adjoining grammars. Computational In-
telligence, 10(4):371–386.
D. J. Weir. 1988. Characterizing Mildly Context-Sensitive
Grammar Formalisms. Ph.D. thesis, Department of Com-
puter and Information Science, University of Pennsylvania.
D. Wu. 1997. Stochastic inversion transduction grammars and
bilingual parsing of parallel corpora. Computational Lin-
guistics, 23(3):377–404, September.
D. H. Younger. 1967. Recognition and parsing of context-free
languages in time a0 a1 . Information and Control, 10(2):189–
208, February.
