Stochastic Inversion Transduction 
Grammars and Bilingual Parsing of 
Parallel Corpora 
Dekai Wu" 
Hong Kong University of Science and 
Technology 
We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual 
language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of 
parallel corpus analysis applications. Aside from the bilingual orientation, three major features 
distinguish the formalism from the finite-state transducers more traditionally found in compu- 
tational linguistics: it skips directly to a context-free rather than finite-state base, it permits a 
minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient 
maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. 
Analysis of the formalism's expressiveness suggests that it is particularly well suited to modeling 
ordering shifts between languages, balancing needed flexibility against complexity constraints. 
We discuss a number of examples of how stochastic inversion transduction grammars bring bilin- 
gual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, 
phrasal alignment, and parsing. 
1. Introduction 
We introduce a general formalism for modeling of bilingual sentence pairs, known as 
an inversion transduction grammar, with potential application in a variety of corpus 
analysis areas. Transduction grammar models, especially of the finite-state family, have 
long been known. However, the imposition of identical ordering constraints upon both 
streams severely restricts their applicability, and thus transduction grammars have re- 
ceived relatively little attention in language-modeling research. The inversion trans- 
duction grammar formalism skips directly to a context-free, rather than finite-state, 
base and permits one extra degree of ordering flexibility, while retaining properties 
necessary for efficient computation, thereby sidestepping the limitations of traditional 
transduction grammars. 
In tandem with the concept of bilingual language-modeling, we propose the con- 
cept of bilingual parsing, where the input is a sentence-pair rather than a sentence. 
Though inversion transduction grammars remain inadequate as full-fledged transla- 
tion models, bilingual parsing with simple inversion transduction grammars turns 
out to be very useful for parallel corpus analysis when the true grammar is not fully 
known. Parallel bilingual corpora have been shown to provide a rich source of con- 
straints for statistical analysis (Brown et al. 1990; Gale and Church 1991; Gale, Church, 
and Yarowsky 1992; Church 1993; Brown et al. 1993; Dagan, Church, and Gale 1993; 
Department of Computer Science, University of Science and Technology, Clear Water Bay, Hong Kong. 
E-mail: dekai@cs.ust.hk 
© 1997 Association for Computational Linguistics 
Computational Linguistics Volume 23, Number 3 
Fung and Church 1994; Wu and Xia 1994; Fung and McKeown 1994). The primary 
purpose of bilingual parsing with inversion transduction grammars is not to flag un- 
grammatical inputs; rather, the aim is to extract structure from the input data, which 
is assumed to be grammatical, in keeping with the spirit of robust parsing. The for- 
malism's uniform integration of various types of bracketing and alignment constraints 
is one of its chief strengths. 
The paper is divided into two main parts. We begin in the first part below by 
laying out the basic formalism, then show that reduction to a normal form is possible. 
We then raise several desiderata for the expressiveness of any bilingual language- 
modeling formalism in terms of its constituent-matching flexibility and discuss how 
the characteristics of the inversion transduction formalism are particularly suited to 
address these criteria. Afterwards we introduce a stochastic version and give an al- 
gorithm for finding the optimal bilingual parse of a sentence-pair. The formalism is 
independent of the languages; we give examples and applications using Chinese and 
English because languages from different families provide a more rigorous testing 
ground. In the second part, we survey a number of sample applications and exten- 
sions of bilingual parsing for segmentation, bracketing, phrasal alignment, and other 
parsing tasks. 
2. Inversion Transduction Grammars 
A transduction grammar describes a structurally correlated pair of languages. For 
our purposes, the generative view is most convenient: the grammar generates trans- 
ductions, so that two output streams are simultaneously generated, one for each lan- 
guage. This contrasts with the common input-output view popularized by both syntax- 
directed transduction grammars and finite-state transducers. The generative view is 
more appropriate for our applications because the roles of the two languages are sym- 
metrical, in contrast to the usual applications of syntax-directed transduction gram- 
mars. Moreover, the input-output view works better when a machine for accepting 
one of the languages (the input language) has a high degree of determinism, which is 
not the case here. 
Our transduction model is context-free, rather than finite-state. Finite-state trans- 
ducers, or FSTs, are well known to be useful for specific tasks such as analysis of 
inflectional morphology (Koskenniemi 1983), text-to-speech conversion (Kaplan and 
Kay 1994), and nominal, number, and temporal phrase normalization (Gazdar and 
Mellish 1989). FSTs may also be used to parse restricted classes of context-free gram- 
mars (Pereira 1991; Roche 1994; Laporte 1996). However, the bilingual corpus analysis 
tasks we consider in this paper are quite different from the tasks for which FSTs are 
apparently well suited. Our domain is broader, and the model possesses very little a 
priori specific structural knowledge of the language. 
As a stepping stone to inversion transduction grammars, we first consider what 
a context-free model known as a simple transduction grammar (Lewis and Stearns 
1968) would look like. Simple transduction grammars (as well as inversion transduc- 
tion grammars) are restricted cases of the general class of context-free syntax-directed 
transduction grammars (Aho and Ullman 1969a, 1969b, 1972); however, we will avoid 
the term syntax-directed here, so as to de-emphasize the input-output connotation as 
discussed above. 
A simple transduction grammar can be written by marking every terminal symbol 
for a particular output stream. Thus, each rewrite rule emits not one but two streams. 
For example, a rewrite rule of the form A ~ Bxly2Czl means that the terminal symbols 
x and z are symbols of. the language L1 emitted on stream 1, while y is a symbol of 
378 
Wu Bilingual Parsing 
(a) S 
SP 
PP 
NP 
NN 
VP 
VV 
Det 
Prep 
Pro 
N 
A 
Conj 
Aux 
Cop 
Stop 
\[SP Stop\] 
\[NP VP\] I \[NP VV\] I \[NP V\] 
--* \[Prep NP\] 
--* \[Det NN\] I \[Det N\]\[ \[Pro\] I \[NP Conj NP\] 
\[A N\] I INN PP\] 
\[Aux VP\] I \[Aux VV\] I \[VV PP\] 
---. \[V NP\] I \[Cop A\] 
the/~ 
--* to/~ 
-* I/~ l you/~ 
authority/~ I secretary/~ 
accountable/~ I financial/~ 
-, and/~l\] 
-* will/~ 
be/c 
--~ */O 
(b) VP ~ (VV PP) 
Figure 1 
A simple transduction grammar (a) and an inverted-orientation production (b). 
the language L2 emitted on stream 2. It follows that every nonterminal stands for a 
class of derivable substring pairs. 
We can use a simple transduction grammar to model the generation of bilingual 
sentence pairs. As a mnemonic convention, we usually use the alternative notation 
A --. B x/y C z/c to associate matching output tokens. Though this additional informa- 
tion has no formal generative effect, it reminds us that x/y must be a valid entry in the 
translation lexicon. We call a matched terminal symbol pair such as x/y a couple. The 
null symbol ¢ means that no output token is generated. We call x/¢ an Ll-singleton, 
and ¢/y an L2-singleton. 
Consider the simple transduction grammar fragment shown in Figure l(a). (It will 
become apparent below why we explicitly include brackets around right-hand sides 
containing nonterminals, which are usually omitted with standard CFGs.) The simple 
transduction grammar can generate, for instance, the following pair of English and 
Chinese sentences in translation: 
(1) a. \[\[\[\[The \[Financial Secretary\]NN \]NP and \[I\]Np \]NP \[will \[be 
accountable\]w \]vP \]sP .\]s 
b. \[\[\[\[\[~ ~----\]\]NN \]NP ~ \[~'~\]NP \]NP \[~ \[~\]VV lVP lSP o \]S 
Notice that each nonterminal derives two substrings, one in each language. The two 
substrings are counterparts of each other. In fact, it is natural to write the parse trees 
together: 
(2) \[\[\[\[The/c \[Financial/l~qC~J( Secretary/~----JlNN \]NP and/~l\] \[I/~:J~\]Np \]NP 
\[will/~@ \[be/c accountable/~t~\]vv \]vP IsP ./o \]s 
Of course, in general, simple transduction grammars are not very useful, precisely 
379 
Computational Linguistics Volume 23, Number 3 
because they require the two languages to share exactly the same grammatical structure 
(modulo those distinctions that can be handled with lexical singletons). For example, 
the following sentence pair from our corpus cannot be generated: 
(3) a. The Authority will be accountable to the Financial Secretary. 
b. ~t:~ ~--a ~ ~ ~,q~ ~ ~ ~. ~ o 
(Authority will to Financial Secretary accountable.) 
To make transduction grammars truly useful for bilingual tasks, we must escape 
the rigid parallel ordering constraint of simple transduction grammars. At the same 
time, any relaxation of constraints must be traded off against increases in the com- 
putational complexity of parsing, which may easily become exponential. The key is 
to make the relaxation relatively modest but still handle a wide range of ordering 
variations. 
The inversion transduction grammar (ITG) formalism only minimally extends the 
generative power of a simple transduction grammar, yet turns out to be surprisingly 
effective. 1 Like simple transduction grammars, ITGs remain a subset of context-free 
(syntax-directed) transduction grammars (Lewis and Steams 1968) but this view is too 
general to be of much help. 2 The productions of an inversion transduction grammar 
are interpreted just as in a simple transduction grammar, except that two possible 
orientations are allowed. Pure simple transduction grammars have the implicit char- 
acteristic that for both output streams, the symbols generated by the right-hand-side 
constituents of a production are concatenated in the same left-to-right order. Inversion 
transduction grammars also allow such productions, which are said to have straight 
orientation. In addition, however, inversion transduction grammars allow productions 
with inverted orientation, which generate output for stream 2 by emitting the con- 
stituents on a production's right-hand side in right-to-left order. We indicate a produc- 
tion's orientation with explicit notation for the two varieties of concatenation operators 
on string-pairs. The operator \[\] performs the "usual" pairwise concatenation so that 
lAB\] yields the string-pair (C1, C2) where C1 = AtB1 and C2 = A2B2. But the operator 0 
concatenates constituents on output stream I while reversing them on stream 2, so that 
Ct = A1B1 but C2 = B2A2. Since inversion is permitted at any level of rule expansion, 
a derivation may intermix productions of either orientation within the parse tree. For 
example, if the inverted-orientation production of Figure l(b) is added to the earlier 
simple transduction grammar, sentence-pair (3) can then be generated as follows: 
(4) a. \[\[\[The Authority\]Np \[will \[\[be accountable\]vv \[to \[the \[\[Financial 
SecretarylNN \]NNN \]NP \]PP \]VP \]VP \]SP -\]S 
b. \[\[\[~\]NP \[~ \[\[\[~'\] \[\[\[~ ~---J\]NN \]NNN \]NP \]PP \[~\]VV \]VP \]VP 
\]sp o ls 
We can show the common structure of the two sentences more clearly and com- 
pactly with the aid of the (/notation: 
1 The expressiveness of simple transduction grammars is equivalent to nondeterministic pushdown 
transducers (Savitch 1982). 2 Also keep in mind that ITGs turn out to be especially suited for bilingual parsing applications, whereas 
pushdown transducers and syntax-directed transduction grammars are designed for monolingual 
parsing (in tandem with generation). 
380 
Wu Bilingual Parsing 
S 
./o 
will/~ 
The/¢ 
/ " p 
/ Authority/~}~ 
P 
be/e accountable/NN the/c 
Financial/l~l~ Secretary/~ 
Figure 2 
Inversion transduction grammar parse tree. 
(5) \[\[\[The/~ Authority/~ \]NP \[will/~@ (\[be/c accountable/~\]vv 
\[to/Fh-J \[the/¢ \[\[Financial/~ Secretary/~lNN \]NNN \]NP \]PP )VP \]vP lsp 
• /o Is 
Alternatively, a graphical parse tree notation is shown in Figure 2, where the (/ level 
of bracketing is indicated by a horizontal line. The English is read in the usual depth- 
first left-to-right order, but for the Chinese, a horizontal line means the right subtree 
is traversed before the left. 
Parsing, in the case of an ITG, means building matched constituents for input 
sentence-pairs rather than sentences. This means that the adjacency constraints given 
by the nested levels must be obeyed in the bracketings of both languages. The result of 
the parse yields labeled bracketings for both sentences, as well as a bracket alignment 
indicating the parallel constituents between the sentences. The constituent alignment 
includes a word alignment as a by-product. 
The nonterminals may not always look like those of an ordinary CFG. Clearly, the 
nonterminals of an ITG must be chosen in a somewhat different manner than for a 
monolingual grammar, since they must simultaneously account for syntactic patterns 
of both languages. One might even decide to choose nonterminals for an ITG that 
do not match linguistic categories, sacrificing this to the goal of ensuring that all 
corresponding substrings can be aligned. 
An ITG can accommodate a wider range of ordering variation between the lan- 
381 
Computational Linguistics Volume 23, Number 3 
Where is the Secretary of Finance when needed ? 
II~¢~ ~ ~~ ~ ~ J\]~ ? 
Figure 3 
An extremely distorted alignment that can be accommodated by an ITG. 
guages than might appear at first blush, through appropriate decomposition of pro- 
ductions (and thus constituents), in conjuction with introduction of new auxiliary non- 
terminals where needed. For instance, even messy alignments such as that in Figure 3 
can be handled by interleaving orientations: 
(6) \[((Where/JJ\]~ is/T) \[\[the/E (Secretary/~ \[of/( Finance/llq~\])\] 
(when/l~ needed/~'~)\]) ?/?\] 
This bracketing is of course linguistically implausible, so whether such parses are ac- 
ceptable depends on one's objective. Moreover, it may even remain possible to align 
constituents for phenomena whose underlying structure is not context-free--say, ellip- 
sis or coordination--as long as the surface structures of the two languages fortuitously 
parallel each other (though again the bracketing would be linguistically implausible). 
We will return to the subject of ITGs' ordering flexibility in Section 4. 
We stress again that the primary purpose of ITGs is to maximize robustness for 
parallel corpus analysis rather than to verify grammaticality, and therefore writing 
grammars is made much easier since the grammars can be minimal and very leaky. 
We consider elsewhere an extreme special case of leaky ITGs, inversion-invariant 
transduction grammars, in which all productions occur with both orientations (Wu 
1995). As the applications below demonstrate, the bilingual lexical constraints carry 
greater importance than the tightness of the grammar. 
Formally, an inversion transduction grammar, or ITG, is denoted by G = 
(N, W1,W2,T¢,S), where dV is a finite set of nonterminals, W1 is a finite set of words 
(terminals) of language 1, }4;2 is a finite set of words (terminals) of language 2, T¢ is 
a finite set of rewrite rules (productions), and S E A/" is the start symbol. The space 
of word-pairs (terminal-pairs) X = (W1 U {c}) x (W2 U {c}) contains lexical transla- 
tions denoted x/y and singletons denoted x/¢ or ¢/y, where x E W1 and y E W2. Each 
production is either of straight orientation written A --~ \[ala2 ... ar\], or of inverted ori- 
entation written A ~ (ala2.. • ar), where ai E A/" U X and r is the rank of the production. 
The set of transductions generated by G is denoted T(G). The sets of (monolingual) 
strings generated by G for the first and second output languages are denoted LffG) 
and L2(G), respectively. 
3. A Normal Form for Inversion Transduction Grammars 
We now show that every ITG can be expressed as an equivalent ITG in a 2-normal form 
that simplifies algorithms and analyses on ITGs. In particular, the parsing algorithm 
of the next section operates on ITGs in normal form. The availability of a 2-normal 
382 
Wu Bilingual Parsing 
form is a noteworthy characteristic of ITGs; no such normal form is available for 
unrestricted context-free (syntax-directed) transduction grammars (Aho and Ullman 
1969b). The proof closely follows that for standard CFGs, and the proofs of the lemmas 
are omitted. 
Lemma 
For any 
duction 
1 
inversion transduction grammar G, there exists an equivalent inversion trans- 
grammar G' where T(G) = T(G'), such that: 
. 
. 
If ¢ E LI(G) and ¢ C L2(G), then G' contains a single production of the 
form S ~ ~ c/c, where S ~ is the start symbol of G ~ and does not appear on 
the right-hand side of any production of G'; 
otherwise G' contains no productions of the form A ~ c/c. 
nemma 
For any 
duction 
duction 
2 
inversion transduction grammar G, there exists an equivalent inversion trans- 
grammar G' where T(G) = T(G'), such that the right-hand side of any pro- 
of G t contains either a single terminal-pair or a list of nonterminals. 
Lemma 
For any 
duction 
tions of 
3 
inversion transduction grammar G, there exists an equivalent inversion trans- 
grammar G t where T(G) -- T(G'), such that G' does not contain any produc- 
the form A ~ B. 
Theorem 1 
For any inversion transduction grammar G, there exists an equivalent inversion trans- 
duction grammar G t in which every production takes one of the following forms: 
s c/c A x/c A IBC\] 
A x/y A ¢/y A (BC) 
Proof 
By Lemmas 1, 2, and 3, we may assume G contains only productions of the form 
S ~ c/c, A --~ x/y, A ~ x/G A ---* ~/y, A ~ \[BIB2\], A --~ (BIB2), A ~ \[B1... Bn\], and 
A ---* (B1 ... B,) where n _> 3 and A ~ S. Include in G ~ all productions of the first six 
types. The remaining two types are transformed as follows: 
For each production of the form A --~ \[B1... Bn\] we introduce new nonterminals 
X1... X,_2 in order to replace the production with the set of rules A --* \[B1X1\],X1 ---+ 
\[B2X2\] ..... Xn-3 --+ \[Bn-2Xn-a\],Xn-2 ---+ \[Bn-IB,\]. Let (e,c) be any string-pair deriv- 
able from A ~ \[B1." Bn\], where e is output on stream 1 and c on stream 2. Define 
e i as the substring of e derived from Bi, and similarly define c i. Then Xi generates 
(e i+1.. .en, c i+1 ...C n) for all 1 ~ i < n - 1, so the new production A --+ \[BIX1\] also 
generates (e, c). No additional string-pairs are generated due to the new productions 
(since each Xi is only reachable from Xi-1 and X1 is only reachable from A). 
For each production of the form A -~ (B1 ... Bn) we replace the production with 
the set of rules A ~ ( B1Y1) , Y1 --~ ( B2 Y2) , . . . , Yn- 3 ---+ ( Bn- R Yn- 2), Yn- 2 --~ ( Bn- I Bn). Let 
(e, c) be any string-pair derivable from A ~ (B1 ''. Bn), where e is output on stream 
1 and c on stream 2. Again define e i and c i as the substrings derived from Bi, but 
in this case (e, c) = (e 1 • • • e ", c" • • • c 1 ). Then Yi generates (e i+1 • • • e n, c n • • • c i+1 ) for all 
383 
Computational Linguistics Volume 23, Number 3 
1 _~ i < n - 1, so the new production A --* (B1Y1) also generates (e,c). Again, no 
additional string-pairs are generated due to the new productions. \[\] 
Henceforth all transduction grammars will be assumed to be in normal form. 
4. Expressiveness Characteristics 
We now turn to the expressiveness desiderata for a matching formalism. It is of course 
difficult to make precise claims as to what characteristics are necessary and/or suffi- 
cient for such a model, since no cognitive studies that are directly pertinent to bilingual 
constituent alignment are available. Nonetheless, most related previous parallel cor- 
pus analysis models share certain conceptual approaches with ours, loosely based on 
cross-linguistic theories related to constituency, case frames, or thematic roles, as well 
as computational feasibility needs. Below we survey the most common constraints 
and discuss their relation to ITGs. 
Crossing Constraints. Arrangements where the matchings between subtrees cross 
each another are prohibited by crossing constraints, unless the subtrees' immediate 
parent constituents are also matched to each other. For example, given the constituent 
matchings depicted as solid lines in Figure 4, the dotted-line matchings corresponding 
to potential lexical translations would be ruled illegal. Crossing constraints are im- 
plicit in many phrasal matching approaches, both constituency-oriented (Kaji, Kida, 
and Morimoto 1992; Cranias, Papageorgiou, and Peperidis 1994; Grishman 1994) and 
dependency-oriented (Sadler and Vendelmans 1990; Matsumoto, Ishimoto, and Ut- 
suro 1993). The theoretical cross-linguistic hypothesis here is that the core arguments 
of frames tend to stay together over different languages. The constraint is also useful 
for computational reasons, since it helps avoid exponential bilingual matching times. 
ITGs inherently implement a crossing constraint; in fact, the version enforced by 
ITGs is even stronger. This is because even within a single constituent, immediate 
subtrees are only permitted to cross in exact inverted order. As we shall argue below, 
this restriction reduces matching flexibility in a desirable fashion. 
Rank Constraints. The second expressiveness desideratum for a matching formal- 
ism is to somehow limit the rank of constituents (the number of children or right- 
hand-side symbols), which dictates the span over which matchings may cross. As the 
number of subtrees of an Ll-constituent grows, the number of possible matchings to 
subtrees of the corresponding L2-constituent grows combinatorially, with correspond- 
ing time complexity growth on the matching process. Moreover, if constituents can 
immediately dominate too many tokens of the sentences, the crossing constraint loses 
effectiveness--in the extreme, if a single constituent immediately dominates the en- 
tire sentence-pair, then any permutation is permissible without violating the crossing 
constraint. Thus, we would like to constrain the rank as much as possible, while still 
permitting some reasonable degree of permutation flexibility. 
Recasting this issue in terms of the general class of context-free (syntax-directed) 
transduction grammars, the number of possible subtree matchings for a single con- 
stituent grows combinatorially with the number of symbols on a production's right- 
hand side. However, it turns out that the ITG restriction of allowing only matchings 
with straight or inverted orientation effectively cuts the combinatorial growth, while 
still maintaining flexibility where needed. 
To see how ITGs maintain needed flexibility, consider Figure 5, which shows all 24 
possible complete matchings between two constituents of length four each. Nearly all 
of these--22 out of 24--can be generated by an ITG, as shown by the parse trees (whose 
384 
Wu Bilingual Parsing 
The Security Bureau grante / authority to__the polic~ station 
Figure 4 
The crossing constraint. 
nonterminal labels are omitted). 3 The 22 permitted matchings are representative of real 
transpositions in word order between the English-Chinese sentences in our data. The 
only two matchings that cannot be generated are very distorted transpositions that we 
might call "inside-out" matchings. We have been unable to find real examples in our 
data of constituent arguments undergoing "inside-out" transposition. 
Note that this hypothesis is for fixed-word-order languages that are lightly in- 
flected, such as English and Chinese. It would not be expected to hold for so-called 
scrambling or free-word-order languages, or heavily inflected languages. However, 
inflections provide alternative surface cues for determining constituent roles (and 
3 As discussed later, in many cases more than one parse tree can generate the same subconstituent 
matching. The trees shown are the canonical parses, as generated by the grammar of Figure 10. 
385 

Wu Bilingual Parsing 
r ITG all matchings ratio 
0 1 1 1.000 
1 1 1 1.000 
2 2 2 1.000 
3 6 6 1.000 
4 22 24 0.917 
5 90 120 0.750 
6 394 720 0.547 
7 1,806 5,040 0.358 
8 8,558 40,320 0.212 
9 41,586 362,880 0.115 
10 206,098 3,628,800 0.057 
11 1,037,718 39,916,800 0.026 
12 5,293,446 479,001,600 0.011 
13 27,297,738 6,227,020,800 0.004 
14 142,078,746 87,178,291,200 0.002 
15 745,387,038 1,307,674,368,000 0.001 
16 3,937,603,038 20,922,789,888,000 0.000 
Figure 6 
Growth in number of legal complete subconstituent matchings for context-free (syntax-directed) 
transduction grammars with rank r, versus ITGs on a pair of subconstituent sequences of 
length r each. 
5. Stochastic Inversion Transduction Grammars 
In a stochastic ITG (SITG), a probability is associated with each rewrite rule. Following 
the standard convention, we use a and b to denote probabilities for syntactic and 
lexical rules, respectively. For example, the probability of the rule NN 0~ \[A N\] is 
aNN-,\[A N\] = 0.4. The probability of a lexical rule A 0.0001 x/y is bA(X,y) ~- 0.001. Let 
W1, W2 be the vocabulary sizes of the two languages, and X = {A1 ..... AN} be the 
set of nonterminals with indices 1,...,N. (For conciseness, we sometimes abuse the 
notation by writing an index when we mean the corresponding nonterminal symbol, 
as long as this introduces no confusion.) Then for every 1 < i < N, the production 
probabilities are subject to the constraint that 
Y~ (ai--qjk\] +ai-~(jk)) + y~ bi(x,y) = 1 
1Kj,kK_N l<_x<wl 
I~y~W2 
We now introduce an algorithm for parsing with stochastic ITGs that computes 
an optimal parse given a sentence-pair using dynamic programming. In bilingual 
parsing, just as with ordinary monolinguat parsing, probabilizing the grammar permits 
ambiguities to be resolved by choosing the maximum-likelihood parse. Our algorithm 
is similar in spirit to the recognition algorithm for HMMs (Viterbi 1967) and to CYK 
parsing (Kasami 1965; Younger 1967). 
Let the input English sentence be el ..... eT and the corresponding input Chinese 
sentence be cl ..... cv. As an abbreviation we write es.t for the sequence of words 
%+1, es+2 ..... et, and similarly for cu v; also, es s = c is the empty string. It is convenient 
to use a 4-tuple of the form q = (s, t, u, v) to identify each node of the parse tree, where 
387 
Computational Linguistics Volume 23 Number 3 
r ITG all matchings ratio 
0 1 1 1.000 
1 2 2 1.000 
2 7 7 1.000 
3 34 34 1.000 
4 207 209 0.990 
5 1,466 1,546 0.948 
6 11,471 13,327 0.861 
7 96,034 130,922 0.734 
8 843,527 1,441,729 0.585 
9 7,678,546 17,572,114 0.437 
10 71,852,559 234,662,231 0.306 
11 687,310,394 3,405,357,682 0.202 
12 6,693,544,171 53,334,454,417 0.126 
13 66,167,433,658 896,324,308,634 0.074 
14 662,393,189,919 16,083,557,845,279 0.041 
15 6,703,261,197,506 306,827,170,866,106 0.022 
16 68,474,445,473,303 6,199,668,952,527,617 0.011 
Figure 7 
Growth in number of all legal subconstituent matchings (complete or partial, meaning that 
some subconstituents are permitted to remain unmatched as singletons) for context-flee 
(syntax-directed) transduction grammars with rank r, versus ITGs on a pair of subconstituent 
sequences of length r each. 
the substrings es..t and ¢u..v both derive from the node q. Denote the nonterminal label 
on q by f(q). Then for any node q = (s, t, u, v), define 
6q(i) = 6stuv(i) = max P\[subtree of q,e(q) = i,i ~ es..t/Cu..v\] subtrees of q 
as the maximum probability of any derivation from i that successfully parses both es .t 
and cu..v. Then the best parse of the sentence pair has probability 60,T,0,v(S). 
The algorithm computes 60,T,0,v(S) using the following recurrences. Note that 
we generalize argmax to the case where maximization ranges over multiple indices, 
by making it vector-valued. Also note that \[\] and 0 are simply constants, written 
mnemonically. The condition (S - s)(t -S) + (U - u)(v - U) ~ 0 is a way to specify 
that the substring in one, but not both, languages may be split into an empty string c 
and the substring itself; this ensures that the recursion terminates, but permits words 
that have no match in the other language to map to an ~ instead. 
1. Initialization 
l<t<T ~t-l,t,v-l,v(i) = bi(et/Cv), 1 < v < V (1) 
1<t<T 
6t-u,v,v(i) = bi(et/~), 0 < v < V (2) 
0<t<T ~t,t,v-l,v(i) =- bi(¢/Cv), 1 < V < V (3) 
388 
Wu Bilingual Parsing 
. 
. 
Recursion 
l(i~N For all i,s,t,u,v such that o_<s<,_<r 
o<_u<v<_v 
t--s+v--u)2 
Gt.v(i) 
Gtuv(i) 
ma \[\] ' 0 • = xG.d0, s, v(0\] 
\[\] if 6~),v(i ) > 6~(i) 
\[ 0 otherwise 
(4) 
(5) 
where 
6~,v(i) 
slur\ \] 
n\[:uv(i) 
o'~}~(i) 
V~)uv(i) 
6J~uv(i) 
~l,v(i) 
cr (> :D stuvk \] 
v 0 ri~ stuv \ \] 
max I~j~N 
1SkSN ~Ssst 
u<U<v (s-s)(t-s)+(u-.)(~-u)~o 
argmax 1SjSN 
1SkSN ~<s<t 
u~UKv (S--s)(t--S)+(U--u)(v--U)~O 
max l~j~N 
l~k~N 
sSSSt u<U<v 
(S--s)(t--S)+(U--u)(v--U)¢O 
argmax 
I<j<_N 
l<k_<N s_<s_<t 
.<u<v 
(S-~)(t-s)+(U-u)(v-u)~ao 
ai.\[jk\] 6sSuU(j) 5Stuv(k) (6) 
ai--\[/k\] 6~S,U(j) 6stuv(k) (7) 
ai--(jk) 6sSUv(j) 6Stuu(k) (8) 
ai~(/k) 6sSUv(j) 5Stuu(k) (9) 
Reconstruction 
Initialize by setting the root of the parse tree to ql = (0, T, 0, V) and its 
nonterminal label to t(ql) = S. The remaining descendants in the optimal 
parse tree are then given recursively for any q = (s, t, u, v) by: 
NIL if t-s+v-u~2 
LEFT(q) = (S,O'~\](f(q)),U,v~\](~.(q))) if Oq(f(q)) = \[\] and t-s+v-u>3 (10) 
(s, cr~ ) (f(q)), v~ ) (f(q)),v) if Oq(t.(q)) = 0 and t-s+v-u>2 
NIL if t-s+v-u~2 
RmHT(q) = (o'~\](t.(q)),t,v~\](f(q)),V) if ~q(f(q)) = \[\] and t-s+v-u>2 (11) 
(cr~) (f(q)), t, u, v~ ) (f(q))) if Oq(f(q)) = 0 and ,-s+v-u>2 
~?(LEFT(q)) = ~*(ffq))(~(q)) (12) 
~(RIGHT(q)) = ~q(~(q))(f(q)) (13) 
The time complexity of this algorithm in the general case is O(N3T3V3), where 
N is the number of distinct nonterminals and T and V are the lengths of the two 
sentences. This is a factor of V 3 more than monolingual chart parsing, but has turned 
out to remain quite practical for corpus analysis, where parsing need not be real-time. 
389 
Computational Linguistics Volume 23, Number 3 
6. Translation-driven Segmentation 
Segmentation of the input sentences is an important step in preparing bilingual cor- 
pora for various learning procedures. Different languages realize the same concept 
using varying numbers of words; for example, a single English word may surface as 
a compound in French. This complicates the problem of matching the words between 
a sentence-pair, since it means that compounds or collocations must sometimes be 
treated as lexical units. The translation lexicon is assumed to contain collocation trans- 
lations to facilitate such multiword matchings. However, the input sentences do not 
come broken into appropriately matching chunks, so it is up to the parser to decide 
when to break up potential collocations into individual words. 
The problem is particularly acute for English and Chinese because word bound- 
aries are not orthographically marked in Chinese text, so not even a default chunking 
exists upon which word matchings could be postulated. (Sentences (2) and (5) demon- 
strate why the obvious trick of taking single characters as words is not a workable 
strategy.) The usual Chinese NLP architecture first preprocesses input text through a 
word segmentation module (Chiang et al. 1992; Lin, Chiang, and Su 1992, 1993; Chang 
and Chen 1993; Wu and Tseng 1993; Sproat et al. 1994; Wu and Fung 1994), but, clearly, 
bilingual parsing will be hampered by any errors arising from segmentation ambigui- 
ties that could not be resolved in the isolated monolingual context because even if the 
Chinese segmentation is acceptable monolingually, it may not agree with the words 
present in the English sentence. Matters are made still worse by unpredictable omis- 
sions in the translation lexicon, even for valid compounds. 
We therefore extend the algorithm to optimize the Chinese sentence segmentation 
in conjunction with the bracketing process. Note that the notion of a Chinese "word" 
is a longstanding linguistic question, that our present notion of segmentation does 
not address. We adhere here to a purely task-driven definition of what a correct "seg- 
mentation" is, namely that longer segments are desirable only when no compositional 
translation is possible. The algorithm is modified to include the following computa- 
tions, and remains the same otherwise: 
1. Initialization 
0 * (~stuv(l) ~-- bi(e~ t/cu..v), O<s<t<T 0<u<v<V (14) 
2. Recursion 
(Sstuv(i) max\[(~)uv(i),( ~0 " ° ' = stuvO), 6stuv(O\] (15) 
\[\] if 6~)uv(i ) > 6}~uv(i ) and \[\] ' 0 • ~stuv(z) > 6stuv(Z) 
Gtuv(i) = (} if 6}~,~v(i ) > 6~,v(i ) and vstuv (i~, > 6st,v(Z • (16) 
0 otherwise 
3. Reconstruction 
LEFT(q) z{ 
NIL 
(s, ~\]( e(q) ), u,,~,~ \]( e(q) ) ) 
(s,,,~) (e(q)), ~,~1 (e(q)),v) 
NIL 
if t-s+v-u<_2 
if Oq(f(q)) = \[\] and t-s+v-u>2 
if Oq(e(q)) = 0 and t-s+v-u>2 
otherwise 
(17) 
390 
Wu Bilingual Parsing 
NIL if t-s+v-u~2 
(rr~\](f(q)),t,v~\](f(q)),v) if Oq(f(q)) = \[\] and t-s+v-u>2 (18) 
RIGHT(q) = (¢~)(f(q)),t,u,v~) (f(q))) if Oq(f(q)) = 0 and t-s+v-u>2 
NIL otherwise 
In our experience, this method has proven extremely effective for avoiding misseg- 
mentation pitfalls, essentially erring only in pathological cases involving coordination 
constructions or lexicon coverage inadequacies. The method is also straightforward to 
employ in tandem with other applications, such as those below. 
7. Bracketing 
Bracketing is another intermediate corpus annotation, useful especially when a full- 
coverage grammar with which to parse a corpus is unavailable (for Chinese, an even 
more common situation than with English). Aside from purely linguistic interest, 
bracket structure has been empirically shown to be highly effective at constraining sub- 
sequent training of, for example, stochastic context-free grammars (Pereira and Schabes 
1992; Black, Garside, and Leech 1993). Previous algorithms for automatic bracketing 
operate on monolingual texts and hence require more grammatical constraints; for ex- 
ample, tactics employing mutual information have been applied to tagged text (Mager- 
man and Marcus 1990). 
Our method based on SITGs operates on the novel principle that lexical correspon- 
dences between parallel sentences yields information from which partial bracketings 
for both sentences can be extracted. The assumption that no grammar is available 
means that constituent categories are not differentiated. Instead, a generic bracket- 
ing transduction grammar is employed, containing only one nonterminal symbol, A, 
which rewrites either recursively as a pair of A's or as a single terminal-pair: 
A a \[A A\] 
A a (A A) 
A "~ Ui/V j 
A ~ ui/¢ 
b~j A --, (/vj 
for all i,j English-Chinese lexical translations 
for all i English vocabulary 
for all j Chinese vocabulary 
Longer productions with rank > 2 are not needed; we show in the subsections below 
that this minimal transduction grammar in normal form is generatively equivalent 
to any reasonable bracketing transduction grammar. Moreover, we also show how 
postprocessing using rotation and flattening operations restores the rank flexibility so 
that an output bracketing can hold more than two immediate constituents, as shown 
in Figure 11. 
The bq distribution actually encodes the English-Chinese translation lexicon with 
degrees of probability on each potential word translation. We have been using a lexicon 
that was automatically learned from the HKUST English-Chinese Parallel Bilingual 
Corpus via statistical sentence alignment (Wu 1994) and statistical Chinese word and 
collocation extraction (Fung and Wu 1994; Wu and Fung 1994), followed by an EM 
word-translation-learning procedure (Wu and Xia 1994). The latter stage gives us the 
bij probabilities directly. For the two singleton productions, which permit any word in 
either sentence to be unmatched, a small c-constant can be chosen for the probabilities 
bit and bq, so that the optimal bracketing resorts to these productions only when it is 
391 
Computational Linguistics Volume 23, Number 3 
otherwise impossible to match the singletons. The parameter a here is of no practical 
effect, and is chosen to be very small relative to the bq probabilities of lexical translation 
pairs. The result is that the maximum-likelihood parser selects the parse tree that best 
meets the combined lexical translation preferences, as expressed by the bij probabilities. 
Pre-/postpositional biases. Many bracketing errors are caused by singletons. With 
singletons, there is no cross-lingual discrimination to increase the certainty between 
alternative bracketings. A heuristic to deal with this is to specify for each of the two 
languages whether prepositions or postpositions are more common, where "preposi- 
tion" here is meant not in the usual part-of-speech sense, but rather in a broad sense 
of the tendency of function words to attach left or right. This simple strategem is 
effective because the majority of unmatched singletons are function words that lack 
counterparts in the other language. This observation holds assuming that the transla- 
tion lexicon's coverage is reasonably good. For both English and Chinese, we specify 
a prepositional bias, which means that singletons are attached to the right whenever 
possible. 
A Singleton-Rebalancing Algorithm. We give here an algorithm for further improv- 
ing the bracketing accuracy in cases of singletons. Consider the following bracketing 
produced by the algorithm of the previous section: 
(7) \[\[The/c \[Authority/~Y~j \[will/~@ (\[be/¢ accountable/~t~ \] \[to the/c 
\[¢/f6-J \[Financial/~ Secretary/~ \]\]\])\]\]\] ./o \] 
The prepositional bias has already correctly restricted the singleton The/¢ to attach to 
the right, but of course The does not belong outside the rest of the sentence, but rather 
with Authority. The problem is that singletons have no discriminative power between 
alternative bracket matchings--they only contribute to the ambiguity. We can minimize 
the impact by moving singletons as deep as possible, closer to the individual word 
they precede or succeed; or in other words, we can widen the scope of the brackets 
immediately following the singleton. In general this improves precision since wide- 
scope brackets are less constraining. 
The algorithm employs a rebalancing strategy reminiscent of balanced tree struc- 
tures using left and right rotations. A left rotation changes a (A(BC)) structure to a 
((AB)C) structure, and vice versa for a right rotation. The task is complicated by the 
presence of both \[\] and 0 brackets with both L1- and L2-singletons, since each com- 
bination presents different interactions. To be legal, a rotation must preserve symbol 
order on both output streams. However, the following lemma shows that any subtree 
can always be rebalanced at its root if either of its children is a singleton of either 
language. 
Lemma 4 
Let x be an Ll-singleton, y be an L2-singleton, and A, B, C be arbitrary terminal or 
nonterminal symbols. Then the following properties hold for the \[\] and () operators, 
where the ~ relation means that the same two output strings are generated, and the 
matching of the symbols is preserved: 
(Associativity) 
\[A\[BC\]\] = \[lAB\]C\] 
(A(BC)) = ((AB)C) 
392 
Wu Bilingual Parsing 
SINK-SINGLETON(node) 
1 if node is not a leaf 
2 if a rotation property applies at node 
3 apply the rotation to node 
4 child *-- the child into which the singleton was rotated 
5 SINK-SINGLETON(child) 
REBALANCE- TREE(node) 
1 if node is not a leaf 
2 REBALANCE-TREE(left-child\[node\]) 
3 REBALANCE-TREE(right-child\[node\]) 
4 SINK-SINGLETON(node) 
Figure 8 
The singleton rebalancing schema. 
(Ll-singleton bidirectionality) 
lax\] = (Ax) 
\[xA\] ~ (xA) 
(L2-singleton flipping commutativity) 
lAy\] : (yA) 
\[yA\] = (Ay) 
(Ll-singleton rotation properties) 
\[x(AB)\] = (x(AB)) ~- ((xA)B) = (\[xA\]B) 
(x\[AB\]) = \[x\[AB\]\] ~ \[\[xA\]B\] ~- \[(xA)B\] 
\[(aB)x\] ~ ((aB)x) ~ (A(Bx)) = (A\[Bx\]) 
(\[AB\]x) = \[lAB\]x\] = \[A\[Bx\]\] = \[A(Bx)\] 
(L2-singleton rotation properties) 
\[y(AB)\] ~ ((AB)y) = (A(By)) = (A\[yB\]) 
(y\[AB\]) = \[lAB\]y\] ~- \[A\[By\]\] ~ \[A(yB)\] 
\[(AB)y\] = (y(AB)) = ((yA)B) = (\[Ay\]B) 
(\[ABly) = \[y\[AB\]\] = \[\[yA\]B\] = \[(Ay)B\] 
The method of Figure 8 modifies the input tree to attach singletons as closely 
as possible to couples, but remaining consistent with the input tree in the following 
sense: singletons cannot "escape" their immediately surrounding brackets. The key is 
that for any given subtree, if the outermost bracket involves a singleton that should 
be rotated into a subtree, then exactly one of the singleton rotation properties will 
apply. The method proceeds depth-first, sinking each singleton as deeply as possible. 
393 
Computational Linguistics Volume 23, Number 3 
1 2 3 4 1 2 3 4 
1 2 3 4 1 2 3 4 
(a) (b) 
Figure 9 
Alternative ITG parse trees for the same matching. 
1 2 3 4 
1 2 3 4 
(c) 
For example, after rebalancing, sentence (7) is bracketed as follows: 
(8) \[\[\[\[The/¢ Authority/~\] \[will/~J~ (\[be/¢ accountable/~\] \[to 
the/¢ \[¢/\[6-J \[Financial/~ Secretary/~ \]\]\])\]\]\] ./o \] 
Flattening the Bracketing. In the worst case, both sentences might have perfectly 
aligned words, lending no discriminative leverage whatsoever to the bracketer. This 
leaves a very large number of choices: if both sentences are of length l, then there 
(2t~ i are ~ l \] ~ possible bracketings with rank 2, none of which is better justified than any 
other. Thus to improve accuracy, we should reduce the specificity of the bracketing's 
commitment in such cases. 
An inconvenient problem with ambiguity arises in the simple bracketing grammar 
above, illustrated by Figure 9; there is no justification for preferring either (a) or (b) over 
the other. In general the problem is that both the straight and inverted concatenation 
operations are associative. That is, \[A\[AA\]\] and \[\[AA\]A\] generate the same two output 
strings, which are also generated by \[AAA\]; and similarly with (A(AA)) and ((AA)A), 
which can also be generated by (AAA). Thus the parse shown in (c) is preferable to 
either (a) or (b) since it does not make an unjustifiable commitment either way. 
Productions in the form of (c), however, are not permitted by the normal form we 
use, in which each bracket can only hold two constituents. Parsing must overcommit, 
since the algorithm is always forced to choose between (A(BC)) and ((AB)C) structures 
even when no choice is clearly better. We could relax the normal form constraint, 
but longer productions clutter the grammar unnecessarily and, in the case of generic 
bracketing grammars, reduce parsing efficiency considerably. 
Instead, we employ a more complicated but better-constrained grammar as shown 
in Figure 10, designed to produce only canonical tail-recursive parses. We differenti- 
ate type A and B constituents, representing subtrees whose roots have straight and 
inverted orientation, respectively. Under this grammar, a series of nested constituents 
with the same orientation will always have a left-heavy derivation. The guarantee 
that parsing will produce a tail-recursive tree facilitates easily identification of those 
nesting levels that are associative (and therefore arbitrary), so that those levels can 
be "flattened" by a postprocessing stage after parsing into non-normal form trees like 
the one in Figure 9(c). The algorithm proceeds bottom-up, eliminating as many brack- 
ets as possible, by making use of the associativity equivalences \[lAB\]C\] = \[ABC\] and 
((ABIC) ~ (ABC). The singleton bidirectionality and flipping commutativity equiv- 
alences (see Lemma 4) can also be applied whenever they render the associativity 
equivalences applicable. 
394 
Wu Bilingual Parsing 
A a \[A B\] 
A -~ \[B B\] 
A a \[C B\] 
A a \[A C\] 
A a \[B C\] 
B Z (A A) 
B a (B A) 
B ~ {C A) 
B a (AC) 
B ~ (B C) 
C ~ ui/vj 
C ~ ui/¢ 
C ~b'i ¢/vj 
for all i,j English-Chinese lexical translations 
for all i English vocabulary 
for all j Chinese vocabulary 
Figure 10 
A stochastic constituent-matching ITG. 
The final result after flattening sentence (8) is as follows: 
(9) \[ The/e Authority/will/ {\[ be/¢ accountable/\] \[ to the/~ ~/Financial/ 
Secretary/ \]) ./ \] 
Experiment. Approximately 2,000 sentence-pairs with both English and Chinese 
lengths of 30 words or less were extracted from our corpus and bracketed using 
the algorithm described. Several additional criteria were used to filter out unsuitable 
sentence-pairs. If the lengths of the pair of sentences differed by more than a 2:1 ratio, 
the pair was rejected; such a difference usually arises as the result of an earlier error 
in automatic sentence alignment. Sentences containing more than one word absent 
from the translation lexicon were also rejected; the bracketing method is not intended 
to be robust against lexicon inadequacies. We also rejected sentence-pairs with fewer 
than two matching words, since this gives the bracketing algorithm no discriminative 
leverage; such pairs accounted for less than 2% of the input data. A random sample of 
the bracketed sentence-pairs was then drawn, and the bracket precision was computed 
under each criterion for correctness. Examples are shown in Figure 11. 
The bracket precision was 80% for the English sentences, and 78% for the Chinese 
sentences, as judged against manual bracketings. Inspection showed the errors to be 
due largely to imperfections of our translation lexicon, which contains approximately 
6,500 English words and 5,500 Chinese words with about 86% translation accuracy (Wu 
and Xia 1994), so a better lexicon should yield substantial performance improvement. 
Moreover, if the resources for a good monolingual part-of-speech or grammar-based 
bracketer such as that of Magerman and Marcus (1990) are available, its output can 
readily be incorporated in complementary fashion as discussed in Section 9. 
395 
Computational Linguistics Volume 23, Number 3 
\[These/~_- a~ arrangements/~J~ will/c c/~-J enhance/\]JIl~ our/~ (\[¢/~ ability/~ll~ 
~\] \[to/e e/Et~ maintain/,.~ monetary/~ stability/l~ in the years to come/el) 
./o \] \[The/¢ Authority/~ will/~ (\[be/e accountable/~\] \[to the/¢ ¢/~ 
Financial/~ Secretary/~\]/ ./o \] 
\[They/~d~ ( are/e right/iE~tf e/nL~ to/e do/~ e/~_~!J~ so/e ) ./o \] 
\[(\[ Even/e more/~l~ important/~l~ \] \[,/¢ however/{EI \]) \[,/c e/B-~, is/~ to make the 
very best of our/e e/~'~=JE~ own/T\];~ e/IY'J talent/,k:q- 1-/o \] 
\[I/~ hope/c e/<>~ employers/~E will/~ make full/e e/~:~ use/~ \[of/e 
those/~\]l~-aZ a\] ((\[~/f~J-r who/X\] \[have acquired/e e/~-~\] new/~J~ skills/~\]l~ \]) 
\[through/~i~ this/L~ programme/~illl\]\]/ ./o 1 
\[I/~J~ have/~, <> at/e length/~-~,~l\]t ( on/e how/,a~,~ we/~ e/-~l~,,~) \[can/~--JJ)~ 
boost/e e/~j~ our/2~i'~ e/~ prosperity/~l~i~\] ./o I 
Figure 11 
Bracketing output examples. (<> = unrecognized input token.) 
8. Alignment 
8.1 Phrasal Alignment 
Phrasal translation examples at the subsentential level are an essential resource for 
many MT and MAT architectures. This requirement is becoming increasingly direct 
for the example-based machine translation paradigm (Nagao 1984), whose translation 
flexibility is strongly restricted if the examples are only at the sentential level. It can 
now be assumed that a parallel bilingual corpus may be aligned to the sentence level 
with reasonable accuracy (Kay and Ri3cheisen 1988; Catizone, Russel, and Warwick 
1989; Gale and Church 1991; Brown, Lai, and Mercer 1991; Chen 1993), even for lan- 
guages as disparate as Chinese and English (Wu 1994). Algorithms for subsentential 
alignment have been developed as well as granularities of the character (Church 1993), 
word (Dagan, Church, and Gale 1993; Fung and Church 1994; Fung and McKeown 
1994), collocation (Smadja 1992), and specially segmented (Kupiec 1993) levels. How- 
ever, the identification of subsentential, nested, phrasal translations within the parallel 
texts remains a nontrivial problem, due to the added complexity of dealing with con- 
stituent structure. Manual phrasal matching is feasible only for small corpora, either 
for toy-prototype testing or for narrowly restricted applications. 
Automatic approaches to identification of subsentential translation units have 
largely followed what we might call a "parse-parse-match" procedure. Each half of 
the parallel corpus is first parsed individually using a monolingual grammar. Subse- 
quently, the constituents of each sentence-pair are matched according to some heuristic 
procedure. A number of recent proposals can be cast in this framework (Sadler and 
Vendelmans 1990; Kaji, Kida, and Morimoto 1992; Matsumoto, Ishimoto, and Utsuro 
1993; Cranias, Papageorgiou, and Peperidis 1994; Grishman 1994). 
The parse-parse-match procedure is susceptible to three weaknesses: 
Appropriate, robust, monolingual grammars may not be available. This 
condition is particularly relevant for many non-Western European 
languages such as Chinese. A grammar for this purpose must be robust 
since it must still identify constituents for the subsequent matching 
process even for unanticipated or ill-formed input sentences. 
396 
Wu Bilingual Parsing 
The grammars may be incompatible across languages. The best-matching 
constituent types between the two languages may not include the same 
core arguments. While grammatical differences can make this problem 
unavoidable, there is often a degree of arbitrariness in a grammar's 
chosen set of syntactic categories, particularly if the grammar is designed 
to be robust. The mismatch can be exacerbated when the monolingual 
grammars are designed independently, or under different theoretical 
considerations. 
Selection between multiple possible arrangements may be arbitrary. By an 
"arrangement" between any given pair of sentences from the parallel 
corpus, we mean a set of matchings between the constituents of the 
sentences. The problem is that in some cases, a constituent in one 
sentence may have several potential matches in the other, and the 
matching heuristic may be unable to discriminate between the options. 
In the sentence pair of Figure 4, for example, both Security Bureau and 
police station are potential lexical matches to ~j. To choose the best 
set of matchings, an optimization over some measure of overlap between 
the structural analysis of the two sentences is needed. Previous 
approaches to phrasal matching employ arbitrary heuristic functions on, 
say, the number of matched subconstituents. 
Our method attacks the weaknesses of the parse-parse-match procedure by us- 
ing (1) only a translation lexicon with no language-specific grammar, (2) a bilingual 
rather than monolingual formalism, and (3) a probabilistic formulation for resolving 
the choice between candidate arrangements. The approach differs in its single-stage 
operation that simultaneously chooses the constituents of each sentence and the match- 
ings between them. 
The raw phrasal translations suggested by the parse output were then filtered to 
remove those pairs containing more than 50% singletons, since such pairs are likely to 
be poor translation examples. Examples that occurred more than once in the corpus 
were also filtered out, since repetitive sequences in our corpus tend to be nongram- 
matical markup. This yielded approximately 2,800 filtered phrasal translations, some 
examples of which are shown in Figure 12. A random sample of the phrasal translation 
pairs was then drawn, giving a precision estimate of 81.5%. 
Although this already represents a useful level of accuracy, it does not in our opin- 
ion reflect the full potential of the formalism. Inspection revealed that performance was 
greatly hampered by our noisy translation lexicon, which was automatically learned; 
it could be manually post-edited to reduce errors. Commercial on-line translation lex- 
icons could also be employed if available. Higher precision could be also achieved 
without great effort by engineering a small number of broad nonterminal categories. 
This would reduce errors for known idiosyncratic patterns, at the cost of manual rule 
building. 
The automatically extracted phrasal translation examples are especially useful 
where the phrases in the two languages are not compositionally derivable solely from 
obvious word translations. An example is \[have acquired/¢ ¢/-~\] new/~J~ skills/~ 
~j~\] in Figure 11. The same principle applies to nested structures also, such as (\[ ~/~ 
I who/,~ \] \[ have acquired/¢ ¢/~\] new/~J~ skills/~ \]), on up to the sentence 
level. 
397 
Computational Linguistics Volume 23, Number 3 
1% in real 1%~\]~ 
Would you ~_~; 
an acceptable starting point for this new policy ~~IJ~~ 
are about 3.5 million pk~-350~ 
born in Hong ~ ~ ~ 
for Hong ~ 
have the right to decide our ~J~m~J~ 
in what way the Government would increase ~(J~{~t~}Jll~~@;~ 
their job opportunities ; and 
last month _L~ J~ 
never to say " never " ~-~"~" 
reserves and surpluses ~\]~\[I~, 
starting point for this new policy ~_~~ 
there will be many practical difficulties in terms \]~@~-~I~,~t 
of implementation 
year ended 3 1 March 1 9 9 1 ~_~Ph~J~ --n u- 
Figure 12 
Examples of extracted phrasal translations. 
8.2 Word Alignment 
Under the ITG model, word alignment becomes simply the special case of phrasal 
alignment at the parse tree leaves. This gives us an interesting alternative perspective, 
from the standpoint of algorithms that match the words between parallel sentences. By 
themselves, word alignments are of little use, but they provide potential anchor points 
for other applications, or for subsequent learning stages to acquire more interesting 
structures. 
Word alignment is difficult because correct matchings are not usually linearly 
ordered, i.e., there are crossings. Without some additional constraints, any word po- 
sition in the source sentence can be matched to any position in the target sentence, 
an assumption that leads to high error rates. More sophisticated word alignment al- 
gorithms therefore attempt to model the intuition that proximate constituents in close 
relationships in one language remain proximate in the other. The later IBM models are 
formulated to prefer collocations (Brown et al. 1993). In the case of word_align (Dagan, 
Church, and Gale 1993; Dagan and Church 1994), a penalty is imposed according to 
the deviation from an ideal matching, as constructed by linear interpolation? 
From this point of view, the proposed technique is a word alignment method that 
imposes a more realistic distortion penalty. The tree structure reflects the assumption 
that crossings should not be penalized as long as they are consistent with constituent 
structure. Figure 7 gives theoretical upper bounds on the matching flexibility as the 
lengths of the sequences increase, where the constituent structure constraints are re- 
flected by high flexibility up to length-4 sequences and a rapid drop-off thereafter. In 
other words, ITGs appeal to a language universals hypothesis, that the core arguments 
of frames, which exhibit great ordering variation between languages, are relatively 
few and surface in syntactic proximity. Of course, this assumption over-simplistically 
4 Direct comparison with word_align should be avoided, however, since it is intended to work on corpora 
whose sentences are not aligned. 
398 
Wu Bilingual Parsing 
blends syntactic and semantic notions. That semantic frames for different languages 
share common core arguments is more plausible than that syntactic frames do. In ef- 
fect we are relying on the tendency of syntactic arguments to correlate closely with 
semantics. If in particular cases this assumption does not hold, however, the damage 
is not too great--the model will simply drop the offending word matchings (dropping 
as few as possible). 
In experiments with the minimal bracketing transduction grammar, the large ma- 
jority of errors in word alignment were caused by two outside factors. First, word 
matchings can be overlooked simply due to deficiencies in our translation lexicon. This 
accounted for approximately 42% of the errors. Second, sentences containing nonliteral 
translations obviously cannot be aligned down to the word level. This accounted for 
another approximate 50% of the errors. Excluding these two types of errors, accuracy 
on word alignment was 96.3%. In other words, the tree structure constraint is strong 
enough to prevent most false matches, but almost never inhibits correct word matches 
when they exist. 
9. Bilingual Constraint Transfer 
9.1 Monolingual Parse Trees 
A parse may be available for one of the languages, especially for well-studied lan- 
guages such as English. Since this eliminates all degrees of freedom in the English 
sentence structure, the parse of the Chinese sentence must conform with that given 
for the English. Knowledge of English bracketing is thus used to help parse the Chi- 
nese sentence; this method facilitates a kind of transfer of grammatical expertise in 
one language toward bootstrapping grammar acquisition in another. 
A parsing algorithm for this case can be implemented very efficiently. Note that 
the English parse tree already determines the split point S for breaking e0. T into two 
constituent subtrees deriving e0..s and eS..T respectively, as well as the nonterminal 
labels j and k for each subtree. The same then applies recursively to each subtree. 
We indicate this by turning S, j, and k into deterministic functions on the English 
constituents, writing Sst, jst, and kst to denote the split point and the subtree labels for 
any constituent es..t. The following simplifications can then be made to the parsing 
algorithm: 
. Recursion 
For all English constituents es, t and all i, u, v such that ~ Ki<N 0<~<~<V / 
6~)uv(i ) -- max ai_r, k., 6s st,, u(jst) 6s~,,cu,v(kst) (19) u<U<v ust stj , . , , 
V~uv(i ) = argmax6s,&t,u,u(jst) 6s,,t,U,v(kst) (20) 
u<UKv 
6~uv(i ) = max ai_(j~tka } 6s,S,,,U,v(jst ) 6s,t,t,u,U(kst) (21) u<U<v 
v (} (i~ stuv, , = argmax 6s, S,t,U,v(jst ) 6S,t,t,u,U( kst ) (22) 
u<UKv 
3. Reconstruction 
{(s,S~t,u,v~\](f(q))) if Oq(£(q)) = \[\] 
L~FT(q) = (s, Gt, v~)(e(q)),v) if Oq(e(q)) = (> (23) 
399 
Computational Linguistics Volume 23, Number 3 
(Sst, t,v~\](g(q)),v) if Oq(g(q)) = \[\] RIGHT(q) (24) 
(Sst, t,u,v~ )(e(q))) if Oq(f(q)) = 0 
g(LEFT(q)) = jst (25) 
~(RIGHT(q)) = kst (26) 
The time complexity for this constrained version of the algorithm drops from O(NBT3V 3) 
to O(TV3). 
9.2 Partial Parse Trees 
A more realistic in-between scenario occurs when partial parse information is available 
for one or both of the languages. Special cases of particular interest include applications 
where bracketing or word alignment constraints may be derived from external sources 
beforehand. For example, a broad-coverage English bracketer may be available. If such 
constraints are reliable, it would be wasteful to ignore them. 
A straightforward extension to the original algorithm inhibits hypotheses that 
are inconsistent with given constraints. Any entries in the dynamic programming ta- 
ble corresponding to illegal subhypotheses--i.e., those that would violate the given 
bracket-nesting or word alignment conditions--are preassigned negative infinity val- 
ues during initialization indicating impossibility. During the recursion phase, computa- 
tion of these entries is skipped. Since their probabilities remain impossible throughout, 
the illegal subhypotheses will never participate in any ML bibracketing. The running 
time reduction in this case depends heavily on the domain constraints. 
We have found this strategy to be useful for incorporating punctuation constraints. 
Certain punctuation characters give constituency indications with high reliability; "per- 
fect separators" include colons and Chinese full stops, while "perfect delimiters" in- 
clude parentheses and quotation marks. 
10. Unrestricted-Form Grammars 
It is possible to construct a parser that accepts unrestricted-form, rather than normal- 
form, grammars. In this case an Earley-style scheme (Earley 1970), employing an active 
chart, can be used. The time complexity remains the same as the normal-form case. 
We have found this to be useful in practice. For bracketing grammars of the type 
considered in this paper, there is no advantage. However, for more complex, linguisti- 
cally structured grammars, the more flexible parser does not require the unreasonable 
numbers of productions that can easily arise from normal-form requirements. For most 
grammars, we have found performance to be comparable or faster than the normal- 
form parser. 
11. Conclusion 
The twin concepts of bilingual language modeling and bilingual parsing have been 
proposed. We have introduced a new formalism, the inversion transduction grammar, 
and surveyed a variety of its applications to extracting linguistic information from 
parallel corpora. Its amenability to stochastic formulation, useful flexibility with leaky 
and minimal grammars, and tractability for practical applications are desirable proper- 
ties. Various tasks such as segmentation, word alignment, and bracket annotation are 
naturally incorporated as subproblems, and a high degree of compatibility with con- 
ventional monolingual methods is retained. In conjunction with automatic procedures 
for learning word translation lexicons, SITGs bring relatively underexploited bilingual 
400 
Wu Bilingual Parsing 
correlations to bear on the task of extracting linguistic information for languages less 
studied than English. 
We are currently pursuing several directions. We are developing an iterative train- 
ing method based on expectation-maximization for estimating the probabilities from 
parallel training corpora. Also, in contrast to the applications discussed here, which 
deal with analysis and annotation of parallel corpora, we are working on incorporating 
the SITG model directly into our run-time translation architecture. The initial results 
indicate excellent performance gains. 
Acknowledgments 
I would like to thank Xuanyin Xia, Eva 
Wai-Man Fong, Pascale Fung, and Derick 
Wood, as well as an anonymous reviewer 
whose comments were of great value. 
References 
Aho, Alfred V. and Jeffrey D. Ullman. 1969a. 
Properties of syntax directed translations. 
Journal of Computer and System Sciences, 
3(3):319-334. 
Aho, Alfred V. and Jeffrey D. Ullman. 
1969b. Syntax directed translations and 
the pushdown assembler. Journal of 
Computer and System Sciences, 3(1):37-56. 
Aho, Alfred V. and Jeffrey D. Ullman. 1972. 
The Theory of Parsing, Translation, and 
Compiling. Prentice Hall, Englewood 
Cliffs, NJ. 
Black, Ezra, Roger Garside, and Geoffrey 
Leech, editors. 1993. Statistically-Driven 
Computer Grammars of English: The 
IBM~Lancaster Approach. Editions Rodopi, 
Amsterdam. 
Brown, Peter F., John Cocke, Stephen 
A. DellaPietra, Vincent J. DellaPietra, 
Frederick Jelinek, John D. Lafferty, Robert 
L. Mercer, and Paul S. Roossin. 1990. A 
statistical approach to machine 
translation. Computational Linguistics, 
16(2):29-85. 
Brown, Peter F., Stephen A. DellaPietra, 
Vincent J. DellaPietra, and Robert 
L. Mercer. 1993. The mathematics of 
statistical machine translation: Parameter 
estimation. Computational Linguistics, 
19(2):263-311. 
Brown, Peter F., Jennifer C. Lai, and Robert 
L. Mercer. 1991. Aligning sentences in 
parallel corpora. In Proceedings of the 29th 
Annual Meeting, pages 169-176, Berkeley, 
CA. Association for Computational 
Linguistics. 
Catizone, Roberta, Graham Russell, and 
Susan Warwick. 1989. Deriving 
translation data from bilingual texts. In 
Proceedings of the First Lexical Acquisition 
Workshop, Detroit, MI. 
Chang, Chao-Huang and Cheng-Der Chen. 
1993. HMM-based part-or-speech tagging 
for Chinese corpora. In Proceedings of the 
Workshop on Very Large Corpora, pages 
40-47, Columbus, OH, June. 
Chen, Stanley F. 1993. Aligning sentences in 
bilingual corpora using lexical 
information. In Proceedings of the 31st 
Annual Meeting, pages 9-16, Columbus, 
OH. Association for Computational 
Linguistics. 
Chiang, Tung-Hui, Jing-Shin Chang, 
Ming-Yu Lin, and Keh-Yih Su. 1992. 
Statistical models for word segmentation 
and unknown resolution. In Proceedings of 
ROCLING-92, pages 121-146. 
Church, Kenneth W. 1993. Char-align: A 
program for aligning parallel texts at the 
character level. In Proceedings of the 31st 
Annual Meeting, pages 1-8, Columbus, 
OH. Association for Computational 
Linguistics. 
Cranias, Lambros, Harris Papageorgiou, and 
Stelios Peperidis. 1994. A matching 
technique in example-based machine 
translation. In Proceedings of the Fifteen th 
International Conference on Computational 
Linguistics, pages 100-104, Kyoto. 
Dagan, Ido and Kenneth W. Church. 1994. 
Termight: Identifying and translating 
technical terminology. In Proceedings of the 
Fourth Conference on Applied Natural 
Language Processing, pages 34-40, 
Stuttgart, October. 
Dagan, Ido, Kenneth W. Church, and 
William A. Gale. 1993. Robust bilingual 
word alignment for machine aided 
translation. In Proceedings of the Workshop 
on Very Large Corpora, pages 1-8, 
Columbus, OH, June. 
Earley, Jay. 1970. An efficient context-free 
parsing algorithm. Communications of the 
Association for Computing Machinery, 
13(2):94-102. 
Fung, Pascale and Kenneth W. Church. 
1994. K-vec: A new approach for aligning 
parallel texts. In Proceedings of the Fifteenth 
International conference on Computational 
Linguistics, pages 1096-1102, Kyoto. 
Fung, Pascale and Kathleen McKeown. 
1994. Aligning noisy parallel corpora 
401 
Computational Linguistics Volume 23, Number 3 
across language groups: Word pair 
feature matching by dynamic time 
warping. In AMTA-94, Association for 
Machine Translation in the Americas, pages 
81-88, Columbia, MD, October. 
Fung, Pascale and Dekai Wu. 1994. 
Statistical augmentation of a Chinese 
machine-readable dictionary. In 
Proceedings of the Second Annual Workshop 
on Very Large Corpora, pages 69-85, Kyoto, 
August. 
Gale, William A. and Kenneth W. Church. 
1991. A program for aligning sentences in 
bilingual corpora. In Proceedings of the 29th 
Annual Meeting, pages 177-184, Berkeley, 
CA. Association for Computational 
Linguistics. 
Gale, William A., Kenneth W. Church, and 
David Yarowsky. 1992. Using bilingual 
materials to develop word sense 
disambiguation methods. In TMI-92, 
Proceedings of the Fourth International 
Conference on Theoretical and Methodological 
Issues in Machine Translation, pages 
101-112, Montreal. 
Gazdar, Gerald and Christopher S. Mellish. 
1989. Natural Language Processing in LISP: 
An Introduction to Computational Linguistics. 
Addison-Wesley, Reading, MA. 
Grishman, Ralph. 1994. Iterative alignment 
of syntactic structures for a bilingual 
corpus. In Proceedings of the Second Annual 
Workshop on Very Large Corpora, pages 
57--68, Kyoto, August. 
Kaji, Hiroyuki, Yuuko Kida, and Yasutsugu 
Morimoto. 1992. Learning translation 
templates from bilingual text. In 
Proceedings of the Fourteenth International 
Conference on Computational Linguistics, 
pages 672-678, Nantes. 
Kaplan, Ronald M. and Martin Kay. 1994. 
Regular models of phonological rule 
systems. Computational Linguistics, 
20(3):331-378. 
Kasami, T. 1965. An efficient recognition 
and syntax analysis algorithm for 
context-free languages. Technical Report 
AFCRL-65-758, Air Force Cambridge 
Research Laboratory, Bedford, MA. 
Kay, Martin and M. ROscheisen. 1988. 
Text-translation alignment. Technical 
Report P90-00143, Xerox Palo Alto 
Research Center. 
Koskenniemi, Kimmo. 1983. Two-level 
morphology: A general computational 
model for word-form recognition and 
production. Technical Report 11, 
Department of General Linguistics, 
University of Helsinki. 
Kupiec, Julian. 1993. An algorithm for 
finding noun phrase correspondences in 
bilingual corpora. In Proceedings of the 31st 
Annual Meeting, pages 17-22, Columbus, 
OH. Association for Computational 
Linguistics. 
Laporte, Eric. 1996. Context-free parsing 
with finite-state transducers. In String 
Processing Colloquium, Recife, Brazil. 
Lewis, P. M. and R. E. Stearns. 1968. 
Syntax-directed transduction. Journal of the 
Association for Computing Machinery, 
15:465-488. 
Lin, Yi-Chung, Tung-Hui Chiang, and 
Keh-Yih Su. 1992. discrimination oriented 
probabilistic tagging. In Proceedings of 
ROCLING-92, pages 85-96. 
Lin, Ming-Yu, Tung-Hui Chiang, and 
Keh-Yih Su. 1993. A preliminary study on 
unknown word problem in chinese word 
segmentation. In Proceedings of 
ROCLING-93, pages 119-141. 
Magerman, David M. and Mitchell 
P. Marcus. 1990. Parsing a natural 
language using mutual information 
statistics. In Proceedings of AAAI-90, Eighth 
National Conference on Artificial Intelligence, 
pages 984-989. 
Matsumoto, Yuji, Hiroyuki Ishimoto, and 
Takehito Utsuro. 1993. Structural 
matching of parallel texts. In Proceedings of 
the 31st Annual Meeting, pages 23-30, 
Columbus, OH. Association for 
Computational Linguistics. 
Nagao, Makoto. 1984. A framework of a 
mechanical translation between Japanese 
and English by analogy principle. In 
Alick Elithorn and Ranan Banerji, editors, 
Artifiical and Human Intelligence: Edited 
Review Papers Presented at the International 
NATO Symposium on Artificial and Human 
Intelligence. North-Holland, Amsterdam, 
pages 173-180. 
Pereira, Fernando. 1991. Finite-state 
approximation of phrase structure 
grammars. In Proceedings of the 29th Annual 
Meeting, Berkeley, CA. Association for 
Computational Linguistics. 
Pereira, Fernando and Yves Schabes. 1992. 
Inside-outside reestimation from partially 
bracketed corpora. In Proceedings of the 
30th Annual Meeting, pages 128-135, 
Newark, DE. Association for 
Computational Linguistics. 
Roche, Emmanuel. 1994. Two parsing 
algorithms by means of finite-state 
transducers. In Proceedings of the Fifteenth 
International Conference on Computational 
Linguistics, Kyoto. 
Sadler, Victor and Ronald Vendelmans. 
1990. Pilot implementation of a bilingual 
knowledge bank. In Proceedings of the 
Thirteenth International Conference on 
402 
Wu Bilingual Parsing 
Computational Linguistics, pages 449-451, 
Helsinki. 
Savitch, Walter J. 1982. Abstract Machines and 
Grammars. Little, Brown, Boston, MA. 
Smadja, Frank A. 1992. How to compile a 
bilingual collocational lexicon 
automatically. In AAAI-92 Workshop on 
Statistically-Based NLP Techniques, pages 
65-71, San Jose, CA, July. 
Sproat, Richard, Chilin Shih, William Gale, 
and Nancy Chang. 1994. A stochastic 
word segmentation algorithm for a 
Mandarin text-to-speech system. In 
Proceedings of the 32nd Annual Meeting, 
pages 66-72, Las Cruces, NM, June. 
Association for Computational 
Linguistics. 
Viterbi, Andrew J. 1967. Error bounds for 
convolutional codes and an 
asymptotically optimal decoding 
algorithm. IEEE Transactions on Information 
Theory, 13:260-269. 
Wu, Dekai. 1994. Aligning a parallel 
English-Chinese corpus statistically with 
lexical criteria. In Proceedings of the 32nd 
Annual Meeting, pages 80-87, Las Cruces, 
NM, June. Association for Computational 
Linguistics. 
Wu, Dekai. 1995. An algorithm for 
simultaneously bracketing parallel texts 
by aligning words. In Proceedings of the 
33rd Annual Meeting, pages 244-251, 
Cambridge, MA, June. Association for 
Computational Linguistics. 
Wu, Dekai and Pascale Fung. 1994. 
Improving Chinese tokenization with 
linguistic filters on statistical lexical 
acquisition. In Proceedings of the Fourth 
Conference on Applied Natural Language 
Processing, pages 180-181, Stuttgart, 
October. 
Wu, Dekai and Xuanyin Xia. 1994. Learning 
an English-Chinese lexicon from a 
parallel corpus. In AMTA-94, Association 
for Machine Translation in the Americas, 
pages 206-213, Columbia, MD, October. 
Wu, Zimin and Gwyneth Tseng. 1993. 
Chinese text segmentation for text 
retrieval: Achievements and problems. 
Journal of The American Society for 
Information Sciences, 44(9):532-542. 
Younger, David H. 1967. Recognition and 
parsing of context-free languages in time 
n 3. Information and Control, 10(2):189-208. 
403 

