An Algorithm for Simultaneously Bracketing Parallel Texts 
by Aligning Words 
Dekai Wu 
HKUST 
Department of Computer Science 
University of Science & Technology 
Clear Water Bay, Hong Kong 
dekai@cs, ust. hk 
Abstract 
We describe a grammarless method for simul- 
taneously bracketing both halves of a paral- 
lel text and giving word alignments, assum- 
ing only a translation lexicon for the language 
pair. We introduce inversion-invariant trans- 
duction grammars which serve as generative 
models for parallel bilingual sentences with 
weak order constraints. Focusing on Wans- 
duction grammars for bracketing, we formu- 
late a normal form, and a stochastic version 
amenable to a maximum-likelihood bracketing 
algorithm. Several extensions and experiments 
are discussed. 
1 Introduction 
Parallel corpora have been shown to provide an extremely 
rich source of constraints for statistical analysis (e.g., 
Brown et al. 1990; Gale & Church 1991; Gale et al. 1992; 
Church 1993; Brown et al. 1993; Dagan et al. 1993; 
Dagan & Church 1994; Fung & Church 1994; Wu & 
Xia 1994; Fung & McKeown 1994). Our thesis in this 
paper is that the lexical information actually gives suffi- 
cient information to extract not merely word alignments, 
but also bracketing constraints for both parallel texts. 
Aside from purely linguistic interest, bracket structure 
has been empirically shown to be highly effective at con- 
straining subsequent training of, for example, stochas- 
tic context-free grammars (Pereira & ~ 1992; 
Black et al. 1993). Previous algorithms for automatic 
bracketing operate on monolingual texts and hence re- 
quire more grammatical constraints; for example, tac- 
tics employing mutual information have been applied to 
tagged text (Magerumn & Marcus 1990). 
Algorithms for word alignment attempt to find the 
matching words between parallel sentences. 1 Although 
word alignments are of little use by themselves, they 
provide potential anchor points for other applications, 
or for subsequent learning stages to acquire more inter- 
esting structures. Our technique views word alignment 
1 Wordmatching is a more accurate term than word alignment 
since the matchings may cross, but we follow the literature. 
and bracket annotation for both parallel texts as an inte- 
grated problem. Although the examples and experiments 
herein are on Chinese and English, we believe the model 
is equally applicable to other language pairs, especially 
those within the same family (say Indo-European). 
Our bracketing method is based on a new formalism 
called an inversion.invariant transduction grammar. By 
their nature inversion-invariant transduction grammars 
overgenerate, because they permit too much constituent- 
ordering freedom. Nonetheless, they turn out to be very 
useful for recognition when the true grammar is not fully 
known. Their purpose is not to flag ungrammatical in- 
pots; instead they assume that the inputs are grammatical, 
the aim being to extract structure from the input data, in 
kindred spirit with robust parsing. 
2 Inversion-Invariant Transduction 
Grammars 
A Wansduction grammar is a bilingual model that gen- 
erates two output streams, one for each language. The 
usual view of transducers as having one input stream and 
one output stream is more appropriate for restricted or 
deterministic finite-state machines. Although finite-state 
transducers have been well studied, they are insufficiently 
powerful for bilingual models. The models we consider 
here are non-deterministic models where the two lan- 
guages' role is symmetric. 
We begin by generalizing transduction to context-free 
form. In a context-free transduction grammar, terminal 
symbols come in pairs that~ are emitted to separate output 
streams. It follows that each rewrite rule emits not one 
but two streams, and that every non-terminal stands for 
a class of derivable substring pairs. For example, in the 
rewrite rule 
A ~ B x/y C z/e 
the terminal symbols z and z are symbols of the language 
Lx and are emitted on stream 1, while the terminal symbol 
y is a symbol of the language L2 and is emitted on stream 
2. This rule implies that z/y must be a valid entry in 
the translation lexicon. A matched terminal symbol pair 
such as z/y is called a couple. As a spe,Aal case, the 
null symbol e in either language means that no output 
244 
S 
PP 
NP 
NN 
VP 
W 
Pro 
Det 
Class 
Prep 
N 
V 
NP VP 
Prep NP 
Pro I Det Class NN 
ModN \[ NNPP 
VV \[ VV NN I VP PP 
V \] Adv V 
I/~ I you/f$ 
~-* for/~ 
~. book/n 
Figure 1: Example IITG. 
token is generated. We call a symbol pair such as x/e an 
Ll-singleton, and ely an L2-singleton. 
We can employ context-free transduction grammars in 
simple attempts at generative models for bilingual sen- 
tence pairs. For example, pretend for the moment that 
the simple ttansduetion grammar shown in Figure 1 is a 
context-free transduction grammar, ignoring the ~ sym- 
bols that are in place of the usual ~ symbols. This gram- 
mar generates the following example pair of English and 
Chinese sentences in translation: 
(1) a. \[I \[\[took \[a book\]so \]vp \[for yon\]~ \]vp \]s 
b. \[~i \[\[~T \[--*W\]so \]w \[~\]~ \]vt, \]s 
Each instance of a non-terminal here actually derives 
two subsltings, one in each of the sentences; these two 
substrings are translation counterparts. This suggests 
writing the parse trees together: 
(2) ~ \[\[took/~Y \[a/~ d~: book/1\[\]so \]vp \[for/~\[~ 
you/~\]pp \]vv \]s 
The problem with context-free transduction granunars 
is that, just as with finite-state transducers, both sentences 
in a translation pair must share exactly the same gram- 
matic~d structure (except for optional words that can be 
handled with lexical singletons). For example, the fol- 
lowing sentence pair with a perfectly valid, alternative 
Chinese translation cannot be generated: 
(3) a. \[I \[\[took \[a book\]so \]vp \[for you\]v~ \]vP \]s 
b. \[~ \[\[~¢~\]~ \[~T \[--~\]so \]vt, \]vP \]s 
We introduce the device of an inversion-invafiant trans- 
duction grammar (IITG) to get around the inflexibility of 
context-free txansduction grammars. Productions are in- 
terpreted as rewrite rules just as with context-free trans- 
duction grammars, with one additional proviso: when 
generating output for stream 2, the constituents on a 
rule's right-hand side may be emitted either left-to-right 
(as usual) or right-to-left (in inverted order). We use 
instead of --~ to indicate this. Note that inversion is 
permitted at any level of rule expansion. 
With this simple proviso, the transduction grammar of 
Figure 1 straightforwardly generates sentence-pair (3). 
However, the IITG's weakened ordering constraints now 
also permit the following sentence pairs, where some 
constituents have been reversed: 
(4) & *\[I \[\[for youlpp \[\[a bookl~p tooklvp \]vp \]s 
b. \[~ \[\[~¢~\]1~ \[~tT \[--:*:It\]so \]w \]vp \]s 
(5) a. *\[\[\[yon for\]re \[\[a book\]so took\]w \]vp I\]s 
b. *\[~ \[\[~\]rp \[\[tl\[:~--\]so ~T\]vP \]VP \]S 
As a bilingual generative linguistic theory, therefore, 
IITGs are not well-motivated (at least for most natural 
language pairs), since the majority of constructs do not 
have freely revexsable constituents. 
We refer to the direction of a production's L2 con- 
stituent ordering as an orientation. It is sometimes useful 
to explicitly designate one of the two possible orienta- 
tions when writing productions. We do this by dis- 
tinguishing two varieties of concatenation operators on 
string-pairs, depending on tim odeatation. Tim operator 
\[\] performs the "usual" paitwise concatenation so that 
\[ A B\] yields the string-pair ( Cx , C2 ) where Cx = A1Bx 
and (52 = A2B2. But the operator 0 concatema~ con- 
stituents on output stream 1 while reversing them on 
stream 2, so that Ci = AxBx but C2 = B2A2. For 
example, the NP .-. Det Class NN rule in the transduc- 
tion grammar above actually expands to two standard 
rewrite rules: 
-. \[Bet NN\] 
(DetClass NN) 
Before turning to bracketing, we take note of three 
lemmas for IITGs (proofs omitted): 
Lemma l For any inversion-invariant transduction 
grammar G, there exists an equivalent inversion- 
invariant transduction grammar G' where T(G) = 
T( G'), such that: 
1. lfe E LI(G) and e E L2(G), then G' contains a 
single production of the form S' --~ e / c, where S' is 
the start symbol of G' and does not appear on the 
right-hand side of any production of G' ; 
2. otherwise G' contains no productions of the form 
A ~ e/e. 
Lemma2 For any inversion-invariant transduction 
grammar G, there exists an equivalent inversion- 
invariant transduction gratrm~r G' where T(G) = 
T(G'), T(G) = T(G'), such that the right-hand side 
of any production of G' contains either a single terminal- 
pair or a list of nonterminals. 
Lemma3 For any inversion-invariant transduction 
grammar G, there exists an equivalent inversion trans- 
duction grammar G' where T( G) = T( G'), such that G' 
does not contain any productions of the form A --, B. 
3 Bracketing Transduction Grammars 
For the remainder of this paper, we focus our attention 
on pure bracketing. We confine ourselves to bracketing 
245 
transduction grammars (BTGs), which are IITGs where 
constituent categories ate not differentiated. Aside from 
the start symbol S, BTGs contain only one non-terminal 
symbol, A, which rewrites either recursively as a string 
of A's or as a single terminal-pair. In the former case, the 
productions has the form A ~-, A ! where we use A ! to ab- 
breviate A... A, where thefanout f denotes the number 
of A's. Each A corresponds to a level of bracketing and 
can be thought of as demarcating some unspecified kind 
of syntactic category. (This same "repetitive expansion" 
restriction used with standard context-free grammars and 
transduetion grammars yields bracketing grammars with- 
out orientation invariauce.) 
A full bracketing transduction grammar of degree f 
contains A productions of every fanout between 2 and 
f, thus allowing constituents of any length up to f. In 
principle, a full BTG of high degree is preferable, hav- 
ing the greatest flexibility to acx~mmdate arbitrarily long 
matching sequences. However, the following theorem 
simplifies our algorithms by allowing us to get away with 
degree-2 BTGs. I ~t~ we will see how postprocessing 
restores the fanout flexibility (Section 5.2). 
Theorem 1 For any full bracketing transduction gram- 
mar T, there exists an equivalent bracketing transduction 
grammar T' in normal form where every production takes 
one of the followingforms: 
S ~ e/e 
S ~ A 
A ~ AA 
A ~ z/y 
A ~ ~:/e 
A ~ ely 
Proof By Lemmas 1, 2, and 3, we may assume T 
contains only productions of the form S ~-* e/e, A 
z/y, A ~ z/e, A ~-* e/y, and A ,--* AA... A. For proof 
by induction, we need only show that any full BTG T of 
degree f > 2 is equivalent to a full BTG T' of degree 
f- 1. It suffices to show that the production A ~-, A ! call 
be removed without any loss to the generated language, 
i.e., tha! the remaining productions in T' can still derive 
any string-pair derivable by T (removing a production 
cannot increase the set of derivable string-pairs). Let 
(E, C) be any siring-pair derivable from A ~ A 1, where 
E is output on stream 1 and C on stream 2. Define 
E i as the substring of E derived from the ith A of the 
production, and similarly define C i. There are two cases 
depending on the concatenation orientation, but (E, C) 
is derivable by T' in either case. 
In the first case, if the derivation used was A ..-, \[A!\], 
thenE = E 1 ...E l andC = C1...C 1. Let(E',C') = 
(E 1 ... E !-x, C1... C1-1). Then (E', C') is derivable 
from A --~ \[A!-I\], and thus (E, C) = (E~E 1, C~C !) 
is derivable from A ~ \[A A\]: In the second case, the 
derivation used was A ---. {A !), and we still have E = 
E 1 ... E ! but now C -- CY... C 1. Now let (E', C") = 
A ~ accountable/~tJ\[ 
A ,---+ anthority/~t~ 
A ~ finauciaYl\[#l~ 
A .-* secretary/~ 
A ~ to/~ 
A ~-, wfll\]~ 
A ~ Jo 
A ,-, beJe 
A ~ thele 
Figure 2: Some relevant lexical productions. 
.. E 1-1 , C 1-1 ... C1). ~ (E', C") is derivable 
(~A --* (A!-I), and thus (E, e) - (E'E !, C!C ") 
is derivable from A ---, (A A). \[7 
4 Stochastic Bracketing Transduction 
Grammars 
In a stochastic BTG (SBTG), each rewrite rule has a prob- 
ability. Let a! denote the probability of the A-production 
with fanout degree f. For the remaining (lexical) pro- 
dnctions, we use b(z, y) to denote P\[A ~ z/vlA\]. The 
probabiliti~ obey the constraint that 
Ea! + Eb(z'Y)= 1 
l ~¢,Y 
For our experiments we employed a normal form trans- 
duction grammar, so a! = 0 for all f # 2. The A- 
productions used were: 
A ~-* AA 
A b(&~) z/v 
A b~O x/e 
A ~%~) e/V 
for all z, y lexical translations 
for all z English vocabulary 
for all y Chinese vocabulary 
The b(z, y) distribution actually encodes the English- 
Chinese translation lexicon. As discussed below, the 
lexicon we employed was automatically learned from a 
parallel corpus, giving us the b(z, y) probabilities di- 
rectly. The latter two singleton forms permit any word 
in either sentence to be unmatched. A small e-constant 
is chosen for the probabilities b(z, e) and b(e, y), so that 
the optimal bracketing resorts to these productions only 
when it is otherwise impossible to match words. 
With BTGs, to parse means to build matched bracket- 
ings for senmnce-pairs rather than sentences. Tiffs means 
that the adjacency constraints given by the nested levels 
must be obeyed in the bracketings of both languages. The 
result of the parse gives bracketings for both input sen- 
tences, as well as a bracket alignment indicating the cor- 
responding brackets between the sentences. The bracket 
alignment includes a word alignment as a byproduct. 
Consider the following sentence pair from our corpus: 
246 
Jo 
will/~\[#~ 
The/c Authority/~t~ 
belt accountabl~ theJ~ 
Financh~tt~ 
Figure 3: Bracketing tree. 
Secretary/--~ 
(6) a. The Authority will be accountable to the Finan- 
cial Secretary. 
b. Ift~l~t'~l~t~t~o 
Assume we have the productions in Figure 2, which is 
a fragment excerpted from our actual BTG. Ignoring cap- 
italization, an example of a valid parse that is consistent 
with our linguistic ideas is: 
(7) \[\[\[ The/e Authority/~t~ \] \[ will/~ (\[ be& 
accountable/~t~ \] \[ to/~ \[ the/¢ \[\[ Financial/~l~ 
Secretary/~ \]\]\]\])\]\] J. \] 
Figure 3 shows a graphic representation of the same 
brac&eting, where the 0 level of lrac, keting is marked 
by the horizontal line. The English is read in the usual 
depth-first left-to-right order, but for the Chinese, a hori- 
zontal line means the right subtree is traversed before the 
left. 
The () notation concisely displays the common struc- 
ture of the two sentences. However, the bracketing is 
clearer if we view the sentences monolingually, which 
allows us to invert the Chinese constituents within the 0 
so that only \[\] brackets need to appear.. 
(8) a. \[\[\[ The Authority \] \[ will \[\[ be accountable \] \[ to 
\[ the \[\[ Financial Secretary \]\]\]\]\]\]1. \] 
k \[\[\[\[ "~,'~ \] \[ ~t' \[\[ I~ \[\[ ~ ~\] \]\]\]\] \[ ~.l 
\]\]\]\] o \] 
In the monolingual view, extra brackets appear in one lan- 
guage whenever there is a singleton in the other language. 
If the goal is just to obtain ~ for monolingual sen- 
tences, the extra brackets can be discarded aft~ parsing: 
(9) \[\[\[ ~,~ \] \[ ~R \[ ~ \[ Igil~ ~ \]\] \[ ~ttt \]\]\] o \] 
The basis of the bracketing strategy can be seen as 
choosing the bracketing that maximizes the (probabilis- 
tically weighted) number of words matched, subject to 
the BTG representational constraint, which has the ef- 
fect of limiting the possible crossing patterns in the word 
alignment. A simpler, related idea of penalizing dis- 
tortion from some ideal matching pattern can be found 
in the statistical translation (Brown et al. 1990; Brown 
et al. 1993) and word alignment (Dagan et al. 1993; 
Dagan & Church 1994) models. Unlike these mod- 
els, however, the BTG aims m model constituent struc- 
ture when determining distortion penalties. In particu- 
lar, crossings that are consistent with the constituent tree 
structure are not penalized. The implicit assumption is 
that core arguments of frames remain similar across lan- 
guages, and tha! core arguments of the same frame will 
surface adjacently. The accuracy of the method on a 
particular language pair will therefore depend upon the 
extent to which this language universals hypothesis holds. 
However, the approach is robust because if the assump- 
tion is violated, damage will be limited to dropping the 
fewest possible crossed word matchings. 
We now describe how a dynzmic-programming parser 
can compute an optimal bxackcting given a sentence-pair 
and a stochastic BTG. In bilingual parsing, just as with or- 
dinary monolingual parsing, probabilizing the grammar 
247 
permits ambiguities to be resolved by choosing the max- 
imum likelihood parse. Our algorithm is similar in spirit 
to the recognition algorithm for HMMs (Viterbi 1967). 
Denote the input English sentence by el, • •., er and 
the corresponding input Chinese sentence by el,..., cv. 
As an abbreviation we write co.., for the sequence of 
words eo+l,e,+2,... ,e~, and similarly for c~..~. Let 
6.tu~ = maxP\[e,..t/e~..~\] be the maximum probability 
of any derivation from A that__ successfully parses both 
substrings es..t and ¢u..v. The best parse of the sentence 
pair is that with probability 60,T,0y. 
The algorithm computes 6o,T,0,V following the recur- 
fences below. 2 The time complexity of this algorithm 
is O(TaV a) where T and V are the lengths of the two 
sen~. 
1. Initialization 
6t--l,t,v--l,v "- 
2. Recursion 
6ttu v "-- 
Ottu u --" 
where 
l<t<T 
b(e,/~ ), 1 < v < V 
maxr/~\[\] 60 1 t stuv~ stuvJ 
.,6\[ \] 611 s~ stuv ~ stuv 
6\[\]uv = max a2 6,suu 6stuv s<S<~ 
u<V<v 
a\[l stuv "- axg s max 6sSut.r 6$tUv 
s<S<t u<U<v 
v \[\] -- arg U max 6,suu6stuv sgut~ s<S<t 
u<U<v 
6J~uv -- max a 2 6sSU~ 6StuU s<$<t 
u<U<v 
*r!~uv = arg s max 6,SV~ 6Stuff s<S<t 
u<U<v 
V~uv = arg U max 6,su~ 6S,uV 
s<S<t u<V<v 
3. Reconstrm:tion Using 4-tuples to name each node 
of the parse tree, initially set qx = (0, T, 0, V) to be the 
root. The remaining descendants in the optimal parse tree 
are then given recursively for any q = (s, t, u, v) by: 
LEFT' " "s ~r\[\] u v \[\] ~ / ~q) = ( ' \[~ '"~' '\[\] ''"~) f if0,t~ = \[\] 
mGHT(q) = t, 
LEFr' " "s o "0 v 0 v" 
RIGHT(q) = (a!~uv,t,u,v~u~) ) ifO, tuv = 0 
Several additional extensions on this algorithm were 
found to be useful, and are briefly described below. De- 
tails are given in Wu (1995). 
2We are gene~!izing argmax as to allow arg to specify the 
index of interest. 
4.1 Simultaneous segmentation 
We often find the same concept realized using different 
numbers of words in the two languages, creating potential 
difficulties for word alignment; what is a single word in 
English may be realized as a compound in Chinese. Since 
Chinese text is not orthographically separated into words, 
the standard methodology is to first preproce~ input texts 
through a segmentation module (Chiang et al. 1992; 
Linet al. 1992; Chang & Chert 1993; Linet al. 1993; 
Wu & Tseng 1993; Sproat et al. 1994). However, this se- 
rionsly degrades our algorithm's performance, since the 
the segmenter may encounter ambiguities that are un- 
resolvable monolingually and thereby introduce errors. 
Even if the Chinese segmentation is acceptable moaolin- 
gually, it may not agree with the division of words present 
in the English sentence. Moreover, conventional com- 
pounds are frequently and unlmxlictably missing from 
translation lexicons, and this can furllu~ degrade perfor- 
Inane. 
To avoid such problems we have extended the algo- 
rithm to optimize the segmentation of the Chinese sen- 
tence in parallel with the ~ting lm~:ess. Note that 
this treatment of segmentation does not attempt to ad- 
dress the open linguistic question of what constitutes a 
Chinese "word". Our definition of a correct "segmenta- 
tion" is purely task-driven: longer segments are desirable 
if and only ff no compositional translation is possible. 
4.2 Pre/post-positional biases 
Many of the bracketing errors are caused by singletons. 
With singletons, there is no cross-lingual discrimination 
to increase the certainty between alternative brackeaings. 
A heuristic to deal with this is to specify for each of the 
two languages whether prepositions or postpositions 
more common, where "preposition" here is meant not 
in the usual part-of-speech sense, but rather in a broad 
sense of the tendency of function words to attach left 
or right. This simple swategcm is effective because the 
majority of unmatched singletons are function words that 
counterparts in the other language. This observation 
holds assuming that the translation lexicon's coverage 
is reasonably good. For both English and Chinese, we 
specify a prepositional bias, which means that singletons 
are attached to the right whenever possible. 
4.3 Punctuation constraints 
Certain punctuation characters give strong constituency 
indications with high reliability. "Perfect separators", 
which include colons and Chinese full stops, and "pet- 
feet delimiters", which include parentheses and quota- 
tion marks, can be used as bracketing constraints. We 
have extended the algorithm to precluded hypotheses 
that are inconsistent with such constraints, by initializ- 
ing those entries in the DP table corresponding to illegal 
sub-hypotheses with zero probabilities, These entries are 
blocked from recomputation during the DP phase. As 
their probabilities always remain zero, the illegal brack- 
etings can never participate in any optimal bracketing. 
248 
5 Postprocessing 
5.1 A Singleton-Rebalancing Algorithm 
We now introduce an algorithm for further improving the 
bracketing accuracy in cases of singletons. Consider the 
following bracketing produced by the algorithm of the 
previous section: 
(10) \[tThe/~ \[\[Authority/~f~ \[wilg~ad (\[be/~ 
accountable/~t~\] \[to the/~ \[~/~ \[Financial/~i~ 
Seaetary/-nl \]\]\])\]ll\] Jo \] 
The prepositional bias has already correctly restricted the 
singleton "Tbe/d' to attach to the right, but of course 
"The" does not belong outside the rest of the sentence, 
but rather with "Authority". The problem is that single- 
tons have no discriminative power between alternative 
bracket matchings--they only contribute to the ambigu- 
ity. However, we can minimize the impact by moving 
singletons as deep as possible, closer to the individual 
word they precede or succeed, by widening the scope 
of the brackets immediately following the singleton. In 
general this improves precision since wide-scope brack- 
ets are less constraining. 
The algorithm employs a rebalancing strategy rem- 
niscent of balanced-tree structures using left and right 
rotations. A left rotation changes a (A(BC)) structure to 
a ((AB)C) structure, and vice versa for a right rotation. 
The task is complicated by the presence of both \[\] and 
0 brackets with both LI- and L2-singletons, since each 
combination presents different interactions. To be legal, 
a rotation must preserve symbol order on both output 
streams. However, the following lemma shows that any 
subtree can always be rebalanced at its root if either of its 
children is a singleton of either language. 
Lenuna 4 Let x be a L1 singleton, y be a L2 singleton, 
and A, B, C be arbitrary constituent subtrees. Then the 
following properties hold for the \[\] and 0 operators: 
(Associativity) 
\[A\[BC\]\] = \[\[AB\]C\] 
(A(BC)) = ((AB)C) 
(L, -singleton bidirectionality) 
lax\] ~-- (A~) 
\[,A\] : (xA) 
(L2-singleton flipping commutativity) 
\[Av\] = (vA) 
\[uA\] = (Av) 
(L 1-singleton rotation properties) 
\[z(AB)\] ~- (x(AB)) ~-- ((zA)B) ~- (\[xA\]B) 
(x\[aB\]) ~--- \[x\[AB\]\] ~--- \[\[zA\]B\] .~ \[(xA)B\] 
\[(AB)x\] = ((AB)~) = (A(B~)) = (A\[B~\]) 
(lAB\]x) ~- \[\[AB\]x\] = \[A\[Bx\]\] ~--- \[A(Bx)\] 
(L~-singleton rotation properties) 
\[v(AB)\] = ((AB)v) = (A(Bv)) = (AtvB\]) 
(y\[AB\]) ~-- \[\[AB\]y\] ~ \[A\[By\]\] ~-- \[A(yB)\] 
\[(AB)v\] ,~ (y(AB)) ~ ((vA)B) ~- (My\]B) 
(\[AB\]v) ~ \[v\[AB\]\] = ttvA\]B\] = \[(Av)B\] 
The method of Figure 4 modifies the input tree to attach 
singletons as closely as possible to couples, but remain- 
ing consistent with the input tree in the following sense: 
singletons cannot "escape" their inmmdiately surround- 
ing brackets. The key is that for any given subtree, if 
the outermost bracket involves a singleton that should 
be rotated into a subtree, then exactly one of the single- 
ton rotation properties will apply. The method proceeds 
depth-first, sinking each singleton as deeply as possible. 
For example, after rebalm~cing, sentence (10) is bracketed 
as follows: 
(11) \[\[\[\[The/e Authority/~\] \[witV~1t' (\[be/e 
accountable/~tft\] \[to the/~ \[dFBJ \[Fhumciai/ll~'i~ 
Secretary/--~ 111)111 Jo \] 
5.2 Flattening the Bracketing 
Because the BTG is in normal form, each bracket can 
only hold two constituents. This improves parsing ef- 
ficiency, but requires overcommiUnent since the algo- 
rithm is always forced to choose between (A(BC)) and 
((AB)C) statures even when no choice is clearly bet- 
ter. In the worst case, both senteau:~ might have perfectly 
aligned words, lending no discriminative leverage what- 
soever to the bfac~ter. This leaves a very large number 
of choices: if both sentences are of length i = m, then 
thel~ ~ (21) 1 possible lracJw~ngs with fanout 2, 
none of which is better justitied than any other. Thus to 
improve accuracy, we should reduce the specificity of the 
bracketing's commitment in such cases. 
We implement this with another postprocessing stage. 
The algorithm proceeds bottom-up, elimiDming as malay 
brackets as possible, by making use of the associafiv- 
ity equivalences \[ABel = \[A\[BC\]\] = \[lAB\]C\] and 
SINK-SINGLETON(node) 
1 ffnode is not aleaf 
2 if a rotation property applies at node 
3 apply the rotation to node 
4 ch//d ~-- the child into which the singleton 
5 was rotated 
6 SINK-SINGLETON(chi/d) 
RE~AL~CE-aXEE(node) 
1 if node is not a leaf 
2 REBALANCE-TREE(left-child\[node\]) 
3 REeALANCE-TREE(right-child\[node\]) 
4 S ~K-SXNGI.,E'ro~(node) 
Figure 4: The singleton rebalancing schema. 
249 
\[These/~ arrangements/~ will/e ef~ enhance/~q~ our/~ (\[d~J ability/~;0\] \[tok dEt ~ maintain/~t~ 
monetary/~t stability/~ in the years to come/e\]) do \] 
\[The/e Authority/~\]~ will/~ (\[be/e accountable/gt~\] \[to the/e elm Financial/l~i~ Secretary/~\]) Jo \] 
\[They/~t!l~J ( are/e right/iE~ d-l-Jff tok do/~ e/~ so/e ) io \] 
\[(\[ Evenk more~ important/l~ \] \[Je however/~_ \]) \[Je e/~, is/~ to make the very best of our/e e/~ffl~ own/~ 
$~ e/~J talent/X~ \] J. \] 
hope/e e/o!~l employers/{l\[~l~ will/~ make full/e dg~rj'~ use/~ \[offe those/\]Jl~a~__\] ((\[dJfJ-V who/&\] \[have 
aequired/e e/$~ new/~i skills/tS~l~ \]) \[through/L~i~t thisJ~l programme/~l'|~\]) J. \] 
have/~ o at/e length/~l ( on/e how/~g~ we/~ e/~ll~) \[canFaJJ)~ boostk d~ilt our/~:~ e/~ prosperity/$~  \]Jo\] 
Figure 5: Bracketing/alignment output examples. (~ = unrecognized input token.) 
(ABC) = (A(BC)) = ((AB)C). Tim singletonbidi- 
rectionality and flipping eommutativity equivalences (see 
Lemma 4) are also applied, whenever they render the as- 
sociativity equivalences applicable. 
The final result after flattening sentence (11) is as fol- 
lows: 
(12) \[ The/e Authority/~\]~ will/g~' (\[ be/e 
accountable/J~tJ!\[ \] \[ to tl~/e elm Financial/l~ Secretary/--~ 1) 
j o \] 
6 Experiments 
Evaluation methodology for bracketing is controversial 
because of varying perspectives on what the "gold stan- 
dard" should be. We identify two prototypical positions, 
and give results for both. One position uses a linguistic 
evaluation criterion, where accuracy is measured against 
some theoretic notion of constituent structure. The other 
position uses a functional evaluation criterion, where the 
"correctness" of a bracketing depends on its utility with 
respect to the application task at hand. For example, here 
we consider a bracket-pair functionally useful if it cor- 
rectly identifies phrasal translations---especially where 
the phrases in the two languages are not compositionally 
derivable solely from obvious word translations. Notice 
that in contrast, the linguistic evaluation criterion is in- 
sensitive to whether the bracketings of the two sentences 
match each other in any semantic way, as long as the 
monolingual bracketings in each sentence are correct. In 
either case, the bracket precision gives the proportion 
of found br~&ets that agree with the chosen correctness 
criterion. 
All experiments reported in this paper were performed 
on sentence-pairs from the HKUST English-Chinese Par- 
allel Bilingual Corpus, which consists of governmental 
transcripts (Wu 1994). The translation lexicon was au- 
tomatically learned from the same corpus via statisti- 
cal sentence alignment (Wu 1994) and statistical Chi- 
nese word and collocation extraction (Fung & Wu 1994; 
Wu & Fung 1994), followed by an EM word-translation 
learning procedure (Wu & Xia 1994). The translation 
lexicon contains an English vocabulary of approximately 
6,500 words and a Chinese vocabulary of approximately 
5,500 words. The mapping is many-to-many, with an 
average of 2.25 Chinese translations per English word. 
The translation accuracy is imperfect (about 86% percent 
weighted precision), which turns out to cause many of 
the bracketing errors. 
Approximately 2,000 sentence-pairs with both English 
and Chinese lengths of 30 words or less were extracted 
from our corpus and bracketed using the algorithm de- 
scribed. Several additional criteria were used to filter 
out unsuitable sentence-pairs. If the lengths of the pair 
of sentences differed by more thml a 2:1 ratio, the pair 
was rejected; such a difference usually arises as the re- 
sult of an earlier error in automatic sentence alignment. 
Sentences containing more than one word absent from 
the translation lexicon were also rejected; the bracketing 
method is not intended to be robust against lexicon inade- 
quacies. We also rejected sentence pairs with fewer than 
two matching words, since this gives the bracketing al- 
gorithm no diso'iminative leverage; such pairs ~c~ounted 
for less than 2% of the input data. A random sample 
of the b~keted sentence pairs was then drawn, and the 
bracket precision was computed under each criterion for 
correctness. Additional examples are shown in Figure 5. 
Under the linguistic criterion, the monolingual bracket 
precision was 80.4% for the English sentences, and 78.4% 
for the Chinese sentences. Of course, monolinguai 
grammar-based bracketing methods can achieve higher 
precision, but such tools assume grammar resources that 
may not be available, such as good Chinese granuna~. 
Moreover, if a good monolingual bracketer is available, 
its output can easily be incorporated in much the same 
way as punctn~ion constraints, thereby combining the 
best of both worlds. Under the functional criterion, the 
parallel bracket precision was 72.5%, lower than the 
monolingual precision since brackets can be correct in 
one language but not the other. Grammar-based bracket- 
ing methods cannot directly produce results of a compa- 
rable nature. 
250 
7 Conclusion 
We have proposed a new tool for the corpus linguist's 
arsenal: a method for simultaneously bracketing both 
halves of a parallel bilingual corpus, using only a word 
translation lexicon. The method can also be seen as a 
word alignment algorithm that employs a realistic dis- 
tortion model and aligns consituents as well as words. 
The basis of the approach is a new inversion-invariant 
transduction grammar formalism. 
Various extension strategies for simultaneous segmen- 
tation, positional biases, punctuation constraints, single- 
ton rebalancing, and bracket flattening have been intro- 
duced. Parallel bracketing exploits a relatively untapped 
source of constraints, in that parallel bilingual sentences 
are used to mutually analyze each other. The model 
nonetheless retains a high degree of compatibility with 
more conventional monolingual formalisms and methods. 
The bracketing and alignment of parallel corpora can 
be fully automatized with zero initial knowledge re- 
sources, with the aid of automatic procedures for learning 
word translation lexicons. This is particularly valuable 
for work on languages for which online knowledge re- 
sources are relatively scarce compared with English. 
Acknowledgement 
I would like to thank Xuanyin Xia, Eva Wai-Man Foug, 
Pascale Fung, and Derick Wood. 

References 
BLACK, EZRA, ROGER GARSIDE, & GEoF~EY I~ (eds.). 
1993. Statistically-driven computer grammars of En- 
glish: The IB~aster approach. Amsterdam: Edi- 
tions Rodopi. 
BROWN, Pt~reR F., JOHN COCKE, STEPHEN A. D~1APt~rgA, 
VINCENT J. ~t~rttA, FR~ERICK J~LnqWK, JOHN D. 
~, ROBERT L. MERCER, & PAUL S. RoossiN. 
1990. A statistical approach to machine translation. Com- 
putational Linguistics, 16(2):29-85. 
BROWN, PETER E, STEPHEN A. DIKLAPmTxA, VINCENT J. DEL- 
LAPteTgA, & ROBERT L. M~CER. 1993. The mathematics 
of statistical machine translation: Parameter estimation. 
Computational Linguistics, 19(2):263-311. 
CHANG, CHAO-HUANG & CHE~G-DER CHEN. 1993. HMM- 
based part-of-speech tagging for Chinese corpora. In Pro- 
ceedings of the Workshop on Very Large Corpora, 40-47, 
Columbus, Ohio. 
CHIANG, TUNG-HUI, JING-SHIN CHANG, MING-YU LIN, & KEH- 
YIH Su. 1992. Statistical models for word segmentation 
and unknown resolution. In Proceedings of ROCLING-92, 
121-146. 
CHURCH, ~ W. 1993. Char-align: A program for align- 
ing parallel texts at the character level. In Proceedings of 
the 31st Annual Conference of the Association for Com- 
putational Linguistics, 1-8, Columbus, OH. 
DAGAN, IDO & KENNETH W. CHURCH. 1994. Termight: Iden- 
tifying and translating technical terminology. In Proceed- 
ings of the Fourth Conference on Applied Natural Lan- 
guage Processing, 34-40, Stuttgart. 
DAGAN, IDO, KENNETH W. CHURCH, & W\[\]\[JJ~ A. GAL~. 
1993. Robust bilingual word alignment for machine aided 
translation. In Proceedings of the Wor~hop on Very Large 
Corpora, 1-8, Columbus, OH. 
FUNO, PASCALE & KENNETH W. CHURCH. 1994. K-vec: A new 
approach for aligning parallel texts. In Proceedings of 
the Fifteenth International Conference on Computational 
Linguistics, 1096-1102, Kyoto. 
FUNG, PASCALE & KATI~J~ McKEoWN. 1994. Aligning 
noisy parallel corpora across language groups: Word pair 
feature matching by dynamic time warping. In AMTA- 
94, Association for Machine Translation in the Americas, 
81-88, Columbia, Maryland. 
FUNO, PASCALE & DEKAI Wu. 1994. Statistical augmentation 
of a Chinese machine-readable dictionary. In Proceedings 
of the Second Annual Workshop on Very Large Corpora, 
69-85, Kyoto. 
GALE, WnH~M A. & ~ W. CHURCH. 1991. Aprogram 
for aligning sentences in bilingual corpora. In Proceed- 
ings of the 29th Annual Conference of the Association for 
Computational Linguistics, 177-184, Berkeley. 
GALE, WnHAM A., KENNETH W. CHURCH, & DAVID 
YAROWSKY. 1992. Using bilingual materials to develop 
word sense disambiguatlon methods. In Fourth Inter- 
national Conference on Theoretical and Methodological 
Issues in Machine Translation, 101-112, Montreal. 
I~, M~o-Yu, Tt~o-Hta ~o, & K~-Ym Su. 1993. 
A preliminary study on unknown word problem in Chi- 
nese word segmentation. In Proceedings ofROCLING-93, 
119-141. 
LIN, YI-CHUNG, TUNG-HUI CHIANG, & KEH-Ym SU. 1992. 
Discrimination oriented pmbabilistic tagging. In Proceed- 
ings of ROCLING-92, 85-96. 
MAGERMAN, DAVID M. & ~ L p. MARCUS. 1990. Parsing 
a natural language using mutual information statistics. In 
Proceedings of AAAI-90, Eighth National Conference on 
Artificial Intelligence, 984--989. 
PEREIRA, FEXNANDO & YVES SCHABES. 1992. Inside-outside 
re, estimation from partially bracketed corpora. In Proceed- 
ings of the 30th Annual Conference of the Association for 
Computational Linguistic:, 128-135, Newark, DE. 
SPROAT, RICHARD, CHn JN SHItl, Wn I JAM GALE, & N. CHANG. 
1994. A stochastic word segmentation algorithm for a 
Mandarin text-to-speech system. In Proceedings of the 
32nd Annual Conference of the Association for Computa- 
tional Linguistics, Lag Cruces, New Mexico. To appear. 
VITERBI, ANDREW J. 1967. Error bounds for convolutional 
codes and an asymptotically optimal decoding algorithm. 
IEEE Transactions on Information Theory, 13:260-269. 
WU, DEKAL 1994. Aligning a parallel English-Chinese corpus 
statistically with lexical criteria. In Proceedings of the 
32ndAnnual Conference of the Association for Computa- 
tional Linguistics, 80-87, \[,as Cruces, New Mexico. 
WU, DEKAI, 1995. Stochastic inversion transduction grammars 
and bilingual parsing of parallel corpora. In preparation. 
WU, DEKAI & PASCALE FUNG. 1994. Improving Chinese tok- 
enization with linguistic filters on statistical lexical acqui- 
sition. In Proceedings of the Fourth Conference on@plied 
Natural Language Processing, 180-181, Stuttgart. 
Wu, D~,AI & XUANTIN XIA. 1994. Learning an English- 
Chinese lexicon from a parallel corpus. In AMTA-94, As- 
sociation for Machine Translation in the Americas, 206- 
213, Columbia, Maryland. 
Wu, ZIMIN & GWYI~TH TSI~G. 1993. Chinese text seg- 
mentation for text retrieval: Achievements and problems. 
Journal of The American Society for Information Science, 
44(9):532-542. 
