Learning Chinese Bracketing Knowledge Based on  
a Bilingual Language Model 
Yajuan Lü, Sheng Li, Tiejun Zhao, Muyun Yang  
School of Computer Science & Engineering, Harbin Institute of Technology 
Harbin, China, 150001 
Email: {lyj,lish,tjzhao,ymy}@mtlab.hit.edu.cn 
 
Abstract  
This paper proposes a new method for 
automatic acquisition of Chinese bracketing 
knowledge from English-Chinese sentence- 
aligned bilingual corpora. Bilingual sentence 
pairs are first aligned in syntactic structure by 
combining English parse trees with a 
statistical bilingual language model. Chinese 
bracketing knowledge is then extracted 
automatically. The preliminary experiments 
show automatically learned knowledge 
accords well with manually annotated 
brackets. The proposed method is 
particularly useful to acquire bracketing 
knowledge for a less studied language that 
lacks tools and resources found in a second 
language more studied. Although this paper 
discusses experiments with Chinese and 
English, the method is also applicable to 
other language pairs. 
Introduction 
The past few years have seen a great success in 
automatic acquisition of monolingual parsing 
knowledge and grammars. The availability of 
large tagged and syntactically bracketed corpora, 
such as Penn Tree bank, makes it possible to 
extract syntactic structure and grammar rules 
automatically (Marcus 1993). Substantial 
improvements have been made to parse western 
language such as English, and many powerful 
models have been proposed (Brill 1993, Collins 
1997). However, very limited progress has been 
achieved in Chinese. 
      Knowledge acquisition is a bottleneck for 
real appication of Chinese parsing. While some 
methods have been proposed to learn syntactic 
knowledge from annotated Chinese corpus, most 
of the methods depended on the annotated or 
partial annotated data(Zhou 1997, Streiter 2000). 
Due to the limited availbility of Chinese 
annotated corpus, tests of these methods are still 
small in scale. Although some institutions and 
universities currently are engaged in building 
Chinese tree bank, no large scale annotated 
corpus has been published until now because the 
complexity in Chinese syntatic sturcture and the 
difficulty in corpus annotation (Chen 1996).  
This paper proposes a novel method to 
facilitate the Chinese tree bank construction. 
Based on English-Chinese bilingual corpora and 
better English parsing, this method obtains 
Chinese bracketing information automatically via 
a bilingual model and word alignment results. 
The main idea of the method is that we may 
acquire knowledge for a language lacking a rich 
collection of resources and tools from a second 
language that is full of them.  
The rest of this paper is organized as 
follows : In the next section, a bilingual language 
model is introduced. Then, a bilingual parsing 
method supervised by English parsing is 
proposed in section 2. Based on the bilingual 
parsing, Chinese bracketing knowlege is 
extracted in section 3. The evaluation and 
discussion are given in section 4. We conclude 
with discussion of future work. 
1 A bilingual language model – ITG 
Wu (1997) has proposed a bilingual language 
model called Inversion Transduction Grammar 
(ITG), which can be used to parse bilingual 
sentence pairs simultaneously. We will give a 
brief description here. For details please refer to 
(Wu 1995, Wu 1997).  
The Inversion Transduction Grammar is a 
bilingual context-free grammar that generates 
two matched output languages (referred to as L
1
 
and L
2
). It also differs from standard context-free 
grammars in that the ITG allows right-hand side 
production in two directions: straight or inverted. 
The following examples are two ITG 
productions: 
C -> [A B] 
C -> <A B> 
Each nonterminal symbol stands for a pair of 
matched strings. For example, the nonterminal A 
stands for the string-pair (A
1
, A
2
). A
1
 is a 
sub-string in L
1
, and A
2 
is A
1
’s corresponding 
translation in L
2
. Similarly, (B
1
, B
2
) denotes the 
string-pair generated by B. The operator [ ] 
performs the usual concatenation, so that C -> [A 
B] yields the string-pair (C
1
, C
2
), where C
1
=A
1
B
1
 
and C
2
=A
2
B
2
. On the other hand, the operator <> 
performs the straight concatenation for language 
1 but the reversing concatenation for language 2, 
so that C -> <A B> yields C
1
=A
1
B
1
, but C
2
=B
2
A
2
. 
The inverted concatenation operator permits the 
extra flexibility needed to accommodate many 
kinds of word-order variation between source 
and target languages (Wu 1995). 
There are also lexical productions of the 
following form in ITG: 
A -> x/y 
This means that a symbol x in language L
1
 is 
translated by the symbol y in language L
2
.  x or y 
may be a null symbol e, which means there may 
be no counterpart string on other side of the 
bitext.  
ITG based parsing matches constituents for 
an input sentence-pair. For example, Figure 1 
shows an ITG parsing tree for an 
English-Chinese sentence-pair. The inverted 
production is indicated by a horizontal line in the 
parsing tree. The English text is read in the usual 
depth-first left to right order, but for the Chinese 
text, a horizontal line means the right sub-tree is 
traversed before the left. The generated parsing 
results are: 
(1) [[[Mr. Wu]
BNP
 [[plays basketball]
VP
 [on 
Sunday ]
PP
 ]
VP
 ]
S
 . ]
S
  
(2) [[[g2568 g1820g10995] [g7155g7411g3837 [g6183 g12738g10711]]] g452] 
We can also represent the common structure 
of the two sentences more clearly and compactly 
with the aid of <> notation: 
(3)  [[<Mr./g1820g10995 Wu/g2568>
BNP
 < [plays/g6183 basketball/g12738g10711]
VP
 
[on/e Sunday/g7155g7411g3837]
PP
 >
VP
 ]
S
 ./g452]
S
 
where the horizontal line from Figure 1 
corresponds to the <> level of bracketing. 
. 
S 
BNP 
VP 
PP 
VP 
Mr./g1820g10995 Wu/g2568 
plays/g6183 basketball/g12738g10711 on/e Sunday/g7155g7411g3837 
S 
./g452 
Figure 1  Inversion transduction Grammar parsing
      Any ITG can be converted to a normal form, 
where all productions are either lexical 
productions or binary-fanout nonterminal 
productions(Wu 1997). If probability is 
associated with each production, the ITG is 
called the Stochastic Inversion Transduction 
Grammar (SITG). 
2 English parsing supervised bilingual 
bracketing 
Because of the difficulty in finding a suitable 
bilingual syntactic grammar for Chinese and 
English, a practical ITG is the generic Bracketing 
Inversion Transduction Grammar (BTG)(Wu 
1995). BTG is a simplified ITG that has only one 
nonterminal and does not use any syntactic 
grammar. A Statistical BTG (SBTG) grammar is 
as follows: 
j
b
i
b
ji
b
aa
veAeuA
vuAAAAAAA
ej
ie
ij
/    ;/ 
   ; /    ;    ];[ 
→→
→><→→
       SBTG employs only one nonterminal 
symbol A that can be used recursively. Here, “a” 
denotes the probability of syntactic rules. 
However, since those constituent categories are 
not differentiated in BTG, it has no practical 
effect here and can be set to an arbitrary constant. 
The remaining productions are all lexical. b
ij
 is 
the translation probability that source word u
i
 
translates into target word v
j
. b
ij
 can be obtained 
using a statistical word-translation model 
(Melamed 2000) or word alignment(Lü 2001a). 
The last two productions denote that the word in 
one language has no counterpart on other side of 
the bitext. A small constant can be chosen for the 
probabilities b
ie
 and b
ej
.   
In BTG, no language specific syntactic 
grammar is used. The maximum-likelihood 
parser selects the parse tree that best satisfies the 
combined lexical translation preferences, as 
expressed by the b
ij
 probabilities. Because the 
expressiveness characteristics of ITG naturally 
constrain the space of possible matching in a 
highly appropriate fashion, BTG achieves 
encouraging results for bilingual bracketing 
using a word-translation lexicon alone (Wu 
1997). 
Since no syntactic knowledge is used in 
SBTG, output grammaticality can not be 
guaranteed. In particular, if the corresponding 
constituents appear in the same order in both 
languages, both straight and inverted, then lexical 
matching does not provide the discriminative 
leverage needed to identify the sub-constituent 
boundaries. For example, consider an 
English-Chinese sentence pair: 
(4) English: That old teacher is our adviser. 
Chinese: g18039g1022g13781g6957g5084g7171g6117g1216g11352g20050g19394g452 
Using SBTG, the bilingual bracketing result is : 
(5) [[[[[[The/g18039g1022 old/g13781] teacher/g6957g5084] is/g7171] our/g6117g1216g11352] 
adviser/g20050g19394] ./g452] 
The result is not consistent with the 
expected syntactic structure. In this case, 
grammatical information about one or both of the 
languages can be very helpful. For example, if we 
know the English parsing result shown in (6), 
then the bilingual bracketing can be determined 
easily; the result should be (7).  
(6) [[That old teacher]
BNP
 [is [our adviser]
BNP
 ]
VP 
.]
S 
(7) [[That/g18039g1022 old/g13781 teacher/g6957g5084] [is/g7171 [our/g6117g1216g11352 
adviser/g20050g19394] ] ./g452] 
From the example, we can see that if one 
language parser is available, the induced 
bilingual bracketing result would be more 
accurate. English parsing methods have been 
well studied and many powerful models have 
been proposed. It will be helpful to make use of 
English parsing results. In the following, we will 
propose a method of bilingual bracketing 
supervised by English parsing.  
Here, English parsing supervised BTG 
means using an English parser’s bracketing 
information as a boundary restriction in the BTG 
language model. But this does not necessitate 
parsing Chinese completely according to the 
same parsing boundary of English. If the English 
parsing structure is totally fixed, it is possible that 
the structure is not linguistically valid for 
Chinese under the formalism of Inversion 
Transduction Grammar. To illustrate this, see the 
example shown in Figure 2.  
If you want to lose weight, you had better eat less bread . 
g3926g7536 g1332 g5831 g1955g17743 g1319g18337g712g7380g3921 g4581 g2519 g19766g2265 g452 
 
eat 
less bread 
VP 
BNP 
g2519   g4581     g19766g2265 
 (a) 
VP 
eat/g2519 less/g4581 
bread/g19766g2265 
X 
 (b) 
Figure 2  A example of mismatch subtree 
VP 
eat less 
bread 
 (c) 
g2519 g4581 g19766g2265 
        The sub-tree for blacked underlined part of 
English and corresponding Chinese are shown in 
Figure 2(a). We can see that the Chinese 
constituents do not match the English 
counterparts in the English structure. In this case, 
our solution is that: the whole English constituent 
of “VP” is aligned with the whole Chinese 
correspondence; i.e., “eat less bread” is matched 
with “g4581g2519g19766g2265” shown in Figure 2(b). At the 
same time, we give the inner structure matching 
according to ITG regardless of the English 
parsing constraint. An “X” tag is introduced to 
indicate that the sub-bilingual-parsing-tree is not 
consistent with the given English sub-tree. Our 
result can also be understood as a flattened 
bilingual parsing tree as shown in Figure 2(c). 
This means that when the bilingual constituents 
couldn’t match in the small syntactic structure, 
we will match them in a larger structure. 
        The main idea is that the given English 
parser is only used as a boundary constraint for 
bilingual parsing. When the constraint is 
incompatible with the bilingual model ITG, we 
use ITG as the default result. This process 
enables parsing to go on regardless of some 
failures in matching. 
We heuristically define a constraint function 
F
e
(s, t) to denote the English boundary constraint, 
where s is the beginning position and t is the end. 
There are three cases of structure matching: 
violate match, exact match and inside match. 
Violate match means the bilingual parsing 
conflicts with the given English bracketing 
boundary. For example, given the following 
English bracketing result (8), (1,2), (1,3), (2,3), 
(2,4) etc. are Violate matches. We assign a 
minimum F
e
(s, t) (0.0001 at present) to prevent 
the structure match from being chosen when an 
alternative match is available. Exact match 
means the match falls exactly on the English 
parsing boundary, and we assign a high F
e
(s, t) 
value (10 at present) to emphasize it. (1,6), (2,5), 
(3,5) are examples. (3,4), (4,5) are examples of 
inside match, and the value 1 is assigned to these 
F
e
(s, t) functions. 
(8) [She/1 [is/2 [a/3 lovely/4 girl/5] ] ./6]    
Let the input English and Chinese sentences 
be 
T
ee ,...
1
 and 
V
cc ,...
1
. As an abbreviation we 
write 
ts
e
...
 for the sequence of words 
tss
eee ...,
,21 ++
, and similarly write 
vu
c
...
. The local 
optimization function =),,,( vutsδ  
]/[max
.... vuts
ceP denotes the maximum probability 
of sub-parsing-tree of node q and that both the 
sub-string 
ts
e
...
 and 
vu
c
...
 derive from node q. 
Thus, the best parser has the 
probability ),0,,0( VTδ . ),,,( vutsδ is calculated as 
the maximum probability combination of all 
possible sub-tree combinations(Wu 1995). To 
insert English parsing constraints in bilingual 
parsing, we integrate the constraint function F
e
(s, 
t) into the local optimization function.  
Computation of the local optimization function is 
then modified as given below:  
.),,,(),,,(),(max),,,(
,),,,(),,,(),(max),,,(
,)],,,(),,,,(max[),,,(
0))(())((
0))(())((
[]
[]
UutSvUSstsFvuts
vUtSUuSstsFvuts
vutsvutsvuts
e
UvuUStsS
vUu
tSs
e
UvuUStsS
vUu
tSs
δδδ
δδδ
δδδ
≠−−+−−
≤≤
≤≤
<>
≠−−+−−
≤≤
≤≤
<>
=
=
=
 
    Initialization is as follows : 
V1,1),/(
V1,1),/(
V1,1),/(
,1,,
,,,1
,1,,1
≤≤≤≤=
≤≤≤≤=
≤≤≤≤=
−
−
−−
vTtceb
vTteeb
vTtceb
vvvtt
tvvtt
vtvvtt
δ
δ
δ
     
where, T ,V is the length of English and Chinese 
sentence respectively. )/(
vt
ceb is the probability 
of translating English word 
t
e  into Chinese word 
v
c . A minimal probability can be assigned to 
empty word alignment b( ee
t
/ ) and b(
v
ce / ). 
The optimal bilingual parsing tree for a 
given sentence-pair can be computed using  
dynamic programming (DP) algorithm(Wu 1997). 
Using the standard SBTG local optimization 
fuction, the obtained bilingual parsing result for 
the given sentence-pair(4) is shown as example 
(5); when using the above modified local 
optimization function, the parsing result is that 
shown as example (7). Comparing the two results, 
we can see that by intergrating English parsing 
constraints into BTG, the bilingual parsing 
becomes more grammatical. Our experiments 
showed that this English parsing supervised BTG 
would improve the accuracy of bilingual 
bracketing by nearly 20% (Lü 2001b). 
The obtained bilingual parsing tree is in the 
normal form of ITG, that is each node in the tree 
is either a lexical node or a binary-fanout 
nonterminal node. We can combine the subtree to 
restore the fanout flexibility using the production 
characters [[AA]A]=[A[AA]]=[AAA] and 
<<AA>A>= <A<AA>>=<AAA>. The combining 
operation could not cross the given English 
parisng boundary.  
3 Chinese bracketing knowledge extraction 
Table 1 shows some bilingual bracketing 
examples obtained using the above method. To 
understand easily, we give the tree form of the 
first example in Figure 3(a). The leaf node is the 
aligned words of the two languages and their 
POS tag categories. These POS tags are 
generated from an English and a Chinese POS 
tagger respectively. The English POS tag and 
phrase tag set are the same as those of the Penn 
Tree Bank (Marcus 1993) and the Chinse POS 
tag set please refer to the web site: 
http://mtlab.hit.edu.cn. The nonterminal node are 
labeled using English sub-tree tags. 
Based on the bilingual parsing result, it is 
easy to extract the Chinese bracketing structure 
according to the Inversion Transduction 
Grammar. For the normal node, the Chinese text 
is traversed in depth-first left to right order, but 
for an inverted node (indicated by a horizontal 
line in the parsing tree or indicated by a <> 
notation in bracketing expression), the right 
sub-tree is traversed before the left. Thus, the 
Chinese parsing tree corresponding to Figure 3(a) 
is shown in Figure 3(b). The nonterminal labels 
are derived from the English sub-tree. The 
extracted Chinese bracketing results from Table1  
Table 1  Bilingual bracketing examples 
1. [<Mr.(NNP)/g1820g10995(nc) Chen(NNP)/g19484(nx) >
BNP
 [is (VBZ) /g7171(vx) < [the(ART)/e representative(NN)/g1207g15932(ng)]
BNP
 
<of (IN) /g11352(usde) [our (PRP$)/g6117g1216(r) company(NN)/g1856g2508(ng)]
BNP
 >
PP
 >
NP
 ]
VP
 .(.)/g452(wj) ]
S
 
2. [Spring(NN)/g7161g3837(t) [is(VBZ)/g7171(vx) <[the(ART)/e first(JJ)/g12544g980(m) e/g1022(q) season(NN)/g4407g14422(ng) ]
BNP
 <in(IN)/g18336
(f) [a(ART)/g980(m) year(NN)/g5192(q) ]
BNP
 >
PP
 >
X
 ]
VP
 .(.)/g452(wj) ]
S
 
3. [[The(ART)/e window(NN)/g12395g4388(ng)]
BNP
 [is/e/VBZ <[e/g7368(d) narrower(JJR)/g10433g12376(a)] [than(IN)/g8616(p) [the(ART)/e 
door(NN)/g19388(ng)]
BNP
 ]
PP
 >
ADJP
 ]
VP
 .(.)/g452(wj)]
S
 
4. [<[The(ART)/e policeman(NN)/g16698g4531(ng)]
BNP
 [who(WP)/e [reported(VBD)/g6265g2590(vg) [the(ART)/g17837(r) e/g980(m) 
accident(NN)/g1119g6937(ng)]
BNP
 ]
VP
 e/g11352(usde) ]
SBAR
 >
NP
 [thinks(VBZ)/g16760g1038(vg) [it(PRP)/g18039(r) [was(VBD)/g7171(vx) 
[Tom(NNP)/g8760g3994(ny) 's(PRP$)/g11352(usde) fault(NN)/g19181(ng) ]
BNP
 ]
VP
 ]
S
 ]
VP
 .(.)/g452(wj) ]
S
 
5. [[The(ART)/e Beijing(NNP)/g2283g1152(nd) zoo(NN)/g2172g10301g3265(ng)]
BNP
 [is(VBZ)/g7171(vx) <[the(ART)/e largest(JJS)/g7380g3835(a) 
e/g11352(usde) zoo(NN)/g2172g10301g3265(ng)]
BNP
 [I(PRP)/g6117(r) [e/g6164(ussu) have(VBP)/e ever(RB)/e visited(VBN)/g2454g16278(vg) e/g17819
(ut) e/g11352(usde) ]
VBP
 ]
S
 >
NP
 ]
VP
 .(.)/g452(wj) ]
S
 
Table 2  The extracted Chinese bracketing results corresponding to Table 1 
1. [[g19484/nx g1820g10995/nc]
BNP
 [g7171/vx [[[g6117g1216/r g1856g2508/ng]
BNP
 g11352/usde]
PP
 g1207g15932/ng ]
NP
 ]
VP
 g452/wj ]
S
 
2. [g7161g3837/t [g7171/vx [[g980/m g5192/q ]
BNP
 g18336/f ]
PP
 [g12544g980/m g1022/q g4407g14422/ng ]
BNP
 ]
VP
 g452/wj ]
S
 
3. [g12395g4388/ng [[g8616/p g19388/ng ]
PP
 g7368/d g10433g12376/a ]
VP
 g452/wj ]
S
 
4. [[[[g6265g2590/vg [g17837/r g980/m g1119g6937/ng]
BNP
 ]
VP
 g11352/usde]
SBAR
 g16698g4531/nc]
NP
 [g16760g1038/vg [g18039/r [g7171/vx [g8760g3994/ny g11352/usde g19181
/ng ]
BNP
 ]
VP
 ]
S
 ]
VP
 g452/wj ]
S
 
5. [[g2283g1152/nd g2172g10301g3265/ng]
BNP
 [g7171/vx [[g6117/r [g6164/ussu g2454g16278/vg g17819/ut g11352/usde]
VBP
 ]
S
 [g7380g3835/a g11352/usde g2172g10301g3265
/ng ]
BNP
 ]
NP
 ]
VP
 g452/wj ]
S
 
 
are listed in Table 2. 
  
 
 
 
 
 
 
 
 
 Figure 3 Extract Chinese Bracketing structure from Bilingual Parsing 
the(ART)/e 
of(IN)/g11352(usde) 
./g918 
S 
VP 
PP 
Mr.(NNP) 
/g1820g10995(nc) 
representative(NN)
  /g1207g15932(ng) 
BNP 
is(VBZ)/g7171(vx) 
BNP 
NP 
BNP 
Chen(NNP)
/g19484(nx) 
company(NN) 
/g1856g2508(ng) 
our(PRP$)
/g6117g1216(r) 
   (a) Bilingual parsing result supervised by English parsing 
g11352(usde) 
g918 
[S] 
[VP] 
[PP] 
g1820g10995(nc) 
g1207g15932(ng) 
g7171(vx) 
[BNP] 
[NP] 
[BNP] 
g19484(nx) 
g6117g1216(r) g1856g2508(ng) 
        (b) The Chinese parsing result extracted from (a) 
  
   It can be seen from Table 2 that the automatic 
acquired bracketing results reflect the Chinese 
structure well though some English phrase tags 
are not suitable to label the corresponding 
Chinese phrase directly. For example, in Table 2, 
the English tags “PP (preposition phrase)” in 
sentence 1 and “SBAR(clause)” in sentence 4  are 
incorrectly tag the corresponding Chinese 
structure. We don’t care about the phrase tags 
here. Our main concern is the bracketing 
boundary of the syntactic structure. The 
bracketing boundary knowledge has been proved 
to be valuable for Chinese grammar induction 
(Zhou 1997). The advantage of our method is that 
the bracketing knowledge is acquired from 
bilingual corpus automatically. It reduces the 
manual labour for corpus tagging, which are 
time-consuming and error-prone.  
4 Evaluation and discussion 
To evaluate the quality of the acquired Chinese 
bracketing boundaries, we compared them with 
the parsing annotation based on an existed 
Chinese syntax annotation scheme. Detail of the 
Chinese syntax annotation scheme and a 
annotated corpus can be download from the 
website http://mtlab.hit. edu.cn.  
The test set consisted of 3,000 
English-Chinese bilingual sentence-pairs that 
come from the machine translation evaluation 
corpus(Duan 1996). The average length is 9.1 
words for English sentences and 12.6 Chinese 
characters for Chinese sentences. The test 
sentence pairs were first aligned at the word level 
based on statistics and  lexicon with a accuracy of 
nearly 90%(Lü 2001a). The English and Chinese 
sentences were parsed based on the Penn Tree 
bank tag set and the Chinese syntax annotation 
scheme respectively. Both the English and the 
Chinese parsing results were manually corrected. 
The corrected Chinese parsing results are used as 
the standard test set. 
We acquired Chinese bracketing results 
using the proposed method. The previous defined 
exact match , violate match, and inside match are 
used to evaluate the accordance between 
acquired bracketing result and the standard 
parsing result. Here, exact match means the 
acquired structure are the same as the standard 
structure; violate match means the acquired 
structure conflict with the standard structure. 
Otherwise, the acquired structure is called a  
inside match. In example (9), A is the standard 
bracketing result, B is the acquired bracketing 
result and C demonstrates the classification of the 
acquired structures. The structure of whole 
sentence are not participate in evaluation. Exact 
match rate(EMR), violate match rate(VMR), and 
inside match rate(IMR) denote the ratio of three 
types of bracketing numbers in all bracketing 
numbers respectively.  
(9) A.  [g11345g14406 g451 g13430g14406 g2656 g15025g14406]
NP
 g7171 [[[g5468g3822 g3911g4413]
BNP
 [g6164/  
     (white     red   and  blue     are     many girls             ) 
g2928g8438]
VSUO
 ]
SS
 g11352 [[g989 g12193 ]
BMP
 g20080g14406]
BNP
 ]
NP
 g452 
( like                         three              colors                        ) 
(In English : White , red and blue are the three colors 
which many girls like .) 
B. [[g11345g14406 g451 g13430g14406 g2656 g15025g14406]
BNP
 [g7171 [[[[g5468g3822 g3911g4413 
g6164]
BNP
 g2928g8438]
S
 g11352]
SBAR
 [g989 g12193 g20080g14406]
BNP
 ]
NP
 ]
VP
 g452 ]
S
 
C. a) exact matchg726 [g11345g14406g451g13430g14406g2656g15025g14406]g727 [g989g12193
g20080g14406]g727[g5468g3822g3911g4413g6164g2928g8438]; [g5468g3822g3911g4413g6164g2928g8438g11352
g989g12193g20080g14406] 
b) violate matchg726[g5468g3822g3911g4413g6164]g727  
c) inside matchg726[g5468g3822g3911g4413g6164g2928g8438g11352]g727[g7171g5468
g3822g3911g4413g6164g2928g8438g11352g989g12193g20080g14406]g727 
Table 3 gives the evaluation result. The 
evaluation results for acquired Chinese structure 
corresponding to six main English phrases (BNP, 
NP, VP, ADJP, ADVP and PP) are also given in 
detail. 
From the results we can see that only a 
fraction of the learned structures are violate 
match(14.03%), most of them are exact match 
(55.46%). In addition, there are also many inside 
match. These inside matches occured due to the 
difference standard in phrase merging between 
Penn Tree bank and the standard Chinese 
annotation scheme. The English phrase structure 
are labeled with more details. While for Chinese, 
the main phrase in the level of sentence are not 
merged futher. For example, the verb and object 
in sentence level are not combined. That is why 
most of the verb phrases(VP) are inside match 
(53.28%). The bracketing boundary of inside 
match can be either right or wrong. We checked 
the correctness of inside match manually and got 
a average accuray of 79.37%. Then the accuracy 
of all acquired structure bracketing is 79.68% 
(EMRg711IMR g216g33Accuracy of IM).  
The violate matches acquired in bilingual 
parsing are mainly due to the empty word 
alignments. Such as in the special strucures 
“ g6238 ...” and “ g15999 ...” in Chinese. The word 
“ g6238 ” and“ g15999 ” has no counterpart word in 
English.They are usually merged with the 
neighboring noun word as shown in example 
(10)g712thus lead to a violate match. It is neccessary 
to build special patterns to handle these structures. 
Word alignment errors also produce violate 
matches in bilingual bracketing. 
(10) [[g6238/p g1194g11352/r g11023/ng] [g4008g6188/vg [g13485/vg [g6117g1216/r g10043g20050
/vg ]]]g452/wj ] 
The Chinese bracketing accuracy obtained 
using our method is comparable to that  of the 
example-based Chinese parser(77%) ( Streiter 
2000), but it is lower than that of  the PCFG 
Chinese parser(84%)(Yao1998). However, 
unlike these two parser, our method needn’t any  
Table 3  Evaluation on acquired Chinese bracketing results 
Type Bracket number EMR IMR VMR Accuracy of IM Accuracy 
all 10119 55.46% 30.51% 14.03% 79.37% 79.68% 
BNP 2533 71.69% 18.95% 9.36% 49.58% 81.09% 
NP 675 65.63% 20.30% 14.07% 70.80% 80.00% 
VP 1676 28.46% 53.28% 18.26% 92.72% 77.86% 
ADJP 192 60.94% 26.04% 13.02% 86.00% 83.33% 
ADVP 120 45.83% 38.33% 15.83% 76.09% 74.99% 
PP 1198 52.92% 27.46% 19.62% 82.67% 75.62% 
Chinese annotated training corpus, which is 
difficult to accumulate. Another advantage of our 
method is that the Chinese bracketing result is 
derived based on English parsing and parallel 
corpus, which make it particularly benefit for 
research on the corresponding relationship 
between Chinese and English phrase. In (Lü 
2001b), we used bilingual bracketing result for 
automatic translation templates acquisition, 
which turns out to be very useful for structure 
transfer in machine translation. In addition, the 
acquired bracketing corpus can be applied to 
many Chinese NLP tasks. It can be used as the 
foundation for further Chinese treebank 
annotation, which will save human labour in a 
great deal. It can also be used to improve the 
efficiency and accuracy in Chinese grammar 
induction (Zhou 1997). Grammar rules can also 
be extracted from the bracketing corpus. For 
example, we can obtain the following BNP rules 
from the acquired bracketing results in Table 2:  
BNP->nx+nc;    BNP->r+ng;     BNP->m+q;       
BNP->m+q+ng;  BNP->r+m+ng;  BNP->ny+usde+ng;  
BNP->nd+ng;    BNP->a+usde+ng; 
Conclusion 
In this paper, we have presented a method to 
learn Chinese syntactic structure from English 
parsing based on a bilingual language model. The 
method creates structure bracketing Chinese 
corpora automatically by taking full advantage of 
English parsing and bilingual corpora. The 
created corpora are very useful for further 
Chinese corpus annotation and parsing 
knowledge acquisition. Primary experiment 
proved the feasibility and validity of the method. 
Although this paper is related to Chinese and 
English, the method is also applicable to other 
language pairs. Obviously, if the concerned 
languages come from same language family, 
such as English and French, the method would be 
more effective. 
Acknowledgements 
This research was funded by High Technology 
Research and Development Program of China 
(2001AA114101). We also would like to thank 
the Institute of Computational Linguistics at 
Peking University for providing bilingual 
corpora for test.  

References  

Dekai Wu (1995) An algorithm for simultaneously 
bracketing parallel texts by aligning words. 
Proceedings of the 33th Annual Meeting of the 
Association for Computational Linguistics, pp. 
244-251.  

Dekai Wu (1997). Stochastic inversion transduction 
grammars and bilingual parsing of parallel corpora. 
Computational Linguistics, 23(3), pp. 377-403 

E. Brill (1993) Transformation-based error driven 
parsing. Proceedings of International Workshop on 
Parsing Technologies. 

Huiming Duan and Shiwen Yu (1996). Report for 
machine translation evaluation. Computer World, 
1996.3:183 (in Chinese) 

I. Dan Melamed (2000). Models of Translational 
Equivalence among words. Computational 
Linguistics 26(2), pp. 221-249 

Keh-Jiann Chen (1996). A model for robust Chinese 
parser. Computational Linguistics and Chinese 
Language Processing. 1(1), pp.183-204 

Marcus M. P., Marcinkiewicz M. A. and Santorini B 
(1993). Building a large annotated corpus of 
English: the Penn Treebank. Computational 
Linguistics, 19(2), pp. 313-330 

Michael Collins (1997).Three generative, lexicalised 
models for statistical parsing. Proceedings of the 
35th Annual Meeting of the ACL, Madrid   

Qiang Zhou and Changning Huang. A Chinese 
syntactic parser based on bracket matching principle. 
Communication of COLIPS, 1997, 7(2), pp.53-59  

Streiter O. and Chen K.J. (2000). Experiments in 
example-based parsing. In Dialogue 2000, 
International Seminar in Computational Linguistics 
and Applications, Tarusa, Russia. 

Yajuan Lü, Tiejun Zhao, Sheng Li and Muyun Yang. 
(2001a) English-Chinese word alignment based on 
statistic and lexicon. Proceedings of 6th Joint 
Symposium of Computational Linguistics, TaiYuan, 
China, pp. 108-115. (in Chinese) 

Yajuan Lü, Ming Zhou, Sheng Li, Changning Huang, 
Tiejun Zhao (2001b). Automatic translation 
template acquisition based on bilingual structure 
alignment.  International Journal of Computational 
Linguistics and Chinese Language Processing. 6(1),  
pp. 1-26. 

Yuan Yao and Kim Teng Lua. (1998).  A Probabilistic 
Context-Free Grammar Parser for Chinese. 
Computer Processing of Oriental Language,11(4), 
pp.393-407 
