Two Statistical Parsing Models Applied to the Chinese Treebank 
Daniel M. Bikel David Chiang 
Department of Computer & Information Science 
University of Pennsylvania 
200 South 33rd Street 
Philadelphia, PA 19104-6389 
(dbikel, dchiang)©cis, upenn, edu 
Abstract 
This paper presents the first-ever 
results of applying statistical pars- 
ing models to the newly-available 
Chinese Treebank. We have em- 
ployed two models, one extracted 
and adapted from BBN's SIFT Sys- 
tem (Miller et al., 1998) and a TAG- 
based parsing model, adapted from 
(Chiang, 2000). On sentences with 
<40 words, the former model per- 
forms at 69% precision, 75% recall, 
and the latter at 77% precision and 
78% recall. 
1 Introduction 
Ever since the success of HMMs' applica- 
tion to part-of-speech tagging in (Church, 
1988), machine learning approaches to nat- 
ural  processing have steadily be- 
come more widespread. This increase has of 
course been due to their proven efficacy in 
many tasks, but also to their engineering effi- 
Cacy. Many machine learning approaches let 
the data speak for itself (data ipsa loquun- 
tur), as it were, allowing the modeler to focus 
on what features of the data are important, 
rather than on the complicated interaction 
of such features, as had often been the case 
with hand-crafted NLP systems. The success 
of statistical methods in particular has been 
quite evident in the area of syntactic pars- 
ing, most recently with the outstanding re- 
sults of (Charniak, 2000) and (Colhns, 2000) 
on the now-standard English test set of the 
Penn Treebank (Marcus et al., 1993). A sig- 
nificant trend in parsing models has been the 
incorporation of linguistically-motivated fea- 
tures; however, it is important to note that 
"linguistically-motivated" does not necessarily 
mean "-dependent"---often, it means 
just the opposite. For example, almost all sta- 
tistical parsers make use of lexicalized non- 
terminals in some way, which allows lexical 
items' indiosyncratic parsing preferences to 
be modeled, but the paring between head 
words and their parent nonterminals is deter- 
mined almost entirely by the training data, 
thereby making this feature--which models 
preferences of particular words of a par- 
ticular ---almost entirely - 
independent. In this paper, we will explore the 
use of two parsing models, which were origi- 
nally designed for English parsing, on parsing 
Chinese, using the newly-available Chinese 
Treebank. We will show that the - 
dependent components of these parsers are 
quite compact, and that with little effort they 
can be adapted to produce promising results 
for Chinese parsing. We also discuss directions 
for future work. 
2 Models and Modifications 
We will briefly describe the two parsing mod- 
els employed (for a full description of the BBN 
model, see (Miller et al., 1998) and also (Bikel, 
2000); for a full description Of the TAG model, 
see (Chiang, 2000)). 
2.1 Model 2 of (Collins, 1997) 
Both parsing models discussed in this paper 
inherit a great deal from this model, so we 
briefly describe its "progenitive" features here, 
describing only how each of the two models of 
this paper differ in the subsequent two sec- 
tions. 
The lexicalized PCFG that sits behind 
Model 2 of (Collins, 1997) has rules of the 
form 
P ~ LnLn-I"'" LIHRI"" .Rn-IRn (1) 
S(will-MD) 
NP(AppI,~NNP) VP(wilI-MD) 
NNP 
I Apple MD VP (buy-VB) 
VB PRT(out-RP) NP(Microsoft--NNP) I \[ I 
buy RP NNP 
I I out Microsoft 
Figure 1: A sample sentence with parse tree. 
where P, Li, R/ and H are all lexicalized 
nonterminals, and P inherits its lexical head 
from its distinguished head child, H. In this 
generative model, first P is generated, then 
its head-child H, then each of the left- and 
right-modifying nonterminals are generated 
from the head outward. The modifying non- 
terminals Li and R/are generated condition- 
ing on P and H, as well as a distance met- 
ric (based on what material intervenes be- 
tween the currently-generated modifying non- 
terminal and H) and an incremental subcat 
frame feature (a multiset containing the com- 
plements of H that have yet to be gener- 
ated on the side of H in which the currently- 
generated nonterminal falls). Note that if the 
modifying nonterminals were generated com- 
pletely independently, the model would be 
very impoverished, but in actuality, by includ- 
ing the distance and subcat frame features, 
the model captures a crucial bit of linguis- 
tic reality, viz., that words often have well- 
defined sets of complements and adjuncts, dis- 
persed with some well-defined distribution in 
the right hand sides of a (context-free) rewrit- 
ing system. 
2.2 BBN Model 
2.2.1 Overview 
The BBN model is also of the lexicalized 
PCFG variety. In the BB.N model, as with 
Model 2 of (Collins, 1997), modifying non- 
terminals are generated conditioning both on 
the parent P and its head child H. Unlike 
Model 2 of (Collins, 1997), they are also gen- 
erated conditioning on the previously gener- 
ated modifying nonterminal, L/-1 or Pq-1, 
and there is no subcat frame or distance fea- 
ture. While the BBN model does not per- 
form at the level of Model 2 of (Collins, 1997) 
on Wall Street Journal text, it is also less 
-dependent, eschewing the distance 
metric (which relied on specific features of the 
English Treebank) in favor of the "bigrams on 
nonterminals" model. 
2.2.2 Model Parameters 
This section briefly describes the top-level 
parameters used in the BBN parsing model. 
We use p to denote the unlexicalized nonter- 
minal corresponding to P in (1), and simi- 
larly for li, ri and h. We now present the top- 
level generation probabilities, along with ex- 
amples from Figure 1. For brevity, we omit the 
smoothing details of BBN's model (see (Miller 
et al., 1998) for a complete description); we 
note that all smoothing weights are computed 
via the technique described in (Bikel et al., 
1997). 
The probability of generating p as the 
root label is predicted conditioning on only 
+TOP+, which is the hidden root of all parse 
trees: 
P (Pl + TOP+), e.g., P(S I + TOP+). (2) 
The probability of generating a head node h 
with a parent p is 
P(h I P), e.g., P(VP \] S). (3) 
The probability of generating a left-modifier 
li is 
PL(li I Z -i,p, h, wh),, e.g., (4) 
PL(NP \] + BEGIN+, S, VP, will) 
2 
when generating the NP for NP(Apple-NNP), 
and the probability of generating a right mod- 
ifier ri is 
PR(ri i ri-i,p, h, Wh), e.g., (5) 
Pn(NP I PRT, VP, VB, buy) 
when generating the NP for NP(Microsoft- 
NNP). 1 
The probabilities for generating lexical ele- 
ments (part-of-speech tags and words) are as 
follows. The part of speech tag of the head of 
the entire sentence, th, is computed condition- 
ing only on the top-most symbol p:2 
P(th I P)- (6) 
Part of speech tags of modifier constituents, 
tli and tri, are predicted conditioning on the 
modifier constituent li or r/, the tag of the 
head constituent, th, and the word of the head 
constituent, Wh 
P(tl, \[li, th, Wh) and P(tr~ \[ ri, th, Wh). (7) 
The head word of the entire sentence, Wh, is 
predicted conditioning only on the top-most 
symbol p and th: 
P(whlth,p). (8) 
Head words of modifier constituents, w h and 
Wry, are predicted conditioning on all the con- 
text used for predicting parts of speech in (7), 
as well as the parts of speech themsleves 
P(wt, \[ tl,, li, th, Wh) 
and P(wri \[ try, ri, th, Wh). (9) 
The original English model also included a 
word feature to heIp reduce part-of-speech 
ambiguity for unknown words, but this com- 
ponent of the model was removed for Chinese, 
as it was -dependent. 
The probability of an entire parse tree is 
the product of the probabilities of generat- 
ing all of the elements of that parse tree, 
1The hidden nonterminal +BEGIN+ is used to 
provide a convenient mechanism for determining the 
initial probability of the underlying Markov process 
generating the modifying nonterminals; the hidden 
nonterminal +END+ is used to provide consistency to 
the underlying Markov process, i.e., so that the proba- 
bilities of all possible nonterminal sequences sum to 1. 
2This is the one place where we altered the original 
model, as the lexical components of the head of the 
entire sentence were all being estimated incorrectly, 
causing an inconsistency in the model. We corrected 
the estimation of th and Wh in our implementation. 
where an element is either a constituent la- 
bel, a part of speech tag or a word. We obtain 
maximum-likelihood estimates of the param- 
eters of this model using frequencies gathered 
from the training data. 
2.3 TAG Model 
The model of (Chiang, 2000) is based 
on stochastic TAG (Resnik, 1992; Schabes, 
1992). In this model a parse tree is built up not 
out of lexicalized phrase-structure rules but by 
tree fragments (called elementary trees) which 
are texicalized in the sense that each fragment 
contains exactly one lexical item (its anchor). 
In the variant of TAG we use, there are 
three kinds of elementary tree: initial, (pred- 
icative) auxiliary, and modifier, and three 
composition operations: substitution, adjunc- 
tion, and sister-adjunction. Figure 2 illus- 
trates all three of these operations, c~i is an 
initial tree which substitutes at the leftmost 
node labeled NP$;/~ is an auxiliary tree which 
adjoins at the node labeled VP. See (Joshi and 
Schabes, 1997) for a more detailed explana- 
tion. 
Sister-adjunction is not a standard TAG op- 
eration, but borrowed from D-Tree Grammar 
(Rainbow et al., 1995). In Figure 2 the modi- 
fier tree V is sister adjoined between the nodes 
labeled VB and NP$. Multiple modifier trees 
can adjoin at the same place, in the spirit of 
(Schabes and Shieber, 1994). 
In stochastic TAG, the probability of gen- 
erating an elementary tree depends on the el- 
ementary tree itself and the elementary tree 
it attaches to. The parameters are as follows: 
= i 
I + P (NONE I = i 
# 
where c~ ranges over initial trees,/~ over aux- 
iliary trees, 3' over modifier trees, and T/over 
nodes. Pi(c~) is the probability of beginning 
a derivation with c~; Ps(o~ I 77) is the prob- 
ability of substituting o~ at 7; Pa(/~ I r/) is 
the probability of adjoining ~ at 7/; finally, 
Pa(NONE I 7) is the probability of nothing 
adjoining at ~/. 
Our variant adds another set of parameters: 
3 
~2 
(O~2) S l ~. 2 
NPJ~ ..... VP S 
,." gap ........... ! ~ 4 
.... • ....... ! i ~.. .............. NP VP .,._.re d 
{ \[ buy \ ": I 
NP VP PRT NP ~ NNP I ~. I I I 
MD VP 
NNP MD VP* RP NNP Apple I 
I I I I will 
Apple will out Microsoft VB PitT NP 
I I I (oq) (fl) (7) (a3) buy RP NNP 
I I out Microsoft 
de.vat.n tree 
tree 
Figure 2: Grammar and derivation for "Apple will buy out Microsoft." 
~ Psa(T I ~7, i,f) + Psa(STOP I ~l,i,f) = 1 
This is the probability of sister-adjoining 7 
between the ith and i + lth children of ~ (al- 
lowing for two imaginary children beyond the 
leftmost and rightmost children). Since multi- 
ple modifier trees can adjoin at the same lo- 
cation, Psa(7) is also conditioned on a flag f 
which indicates whether '7 is the first modi- 
fier tree (i.e., the one closest to the head) to 
adjoin at that location. 
For our model we break down these prob- 
abilities further: first the elementary tree is 
generated without its anchor, and then its an- 
chor is generated. See (Chiang, 2000) for more 
details. 
During training each example is broken 
into elementary trees using head rules and 
argument/adjunct rules similar to those of 
(Collins, 1997). The rules are interpreted as 
follows: a head is kept in the same elemen- 
tary tree in its parent, an argument is broken 
off into a separate initial tree, leaving a sub- 
stitution node, and an adjunct is broken off 
into a separate modifier tree. A different rule 
is used for extracting auxiliary trees; see (Chi- 
ang, 2000) for details. Xia (1999) describes a 
similar process, and in fact our rules for the 
Xinhua corpus are based on hers. 
2.4 Modifications 
The primary -dependent component 
that had to be changed in both models was 
the head table, used to determine heads when 
training. We modified the head rules described 
in (Xia, 1999) for the Xinhua corpus and sub- 
stituted these new rules into both models. 
The (Chiang, 2000) model had the following 
additional modifications. 
• The new corpus had to be prepared for 
use with the trainer and parser. Aside 
from technicalities, this involved retrain- 
ing the part-of-speech tagger described in 
(Ratnaparkhi, 1997), which was used for 
tagging unknown words. We also lowered 
the unknown word threshold from 4 to 2 
because the Xinhua corpus was smaller 
than the WSJ corpus. 
• In addition to the change to the head- 
finding rules, we also changed the rules 
for classifying modifiers as arguments or 
adjuncts. In both cases the new rules were 
adapted from (Xia, 1999). 
• For the tests done in this paper, a beam 
width of 10 -4 was used. 
The BBN model had the following additional 
modifications: 
• As with the (Chiang, 2000) model, we 
similarly lowered the unknown word 
threshold of the BBN model from its de- 
fault 5 to 2. 
• The -dependent word-feature 
was eliminated, causing parts of speech 
for unknown words to be predicted solely 
on the head relations in the model. 
• The default beam size in the probabilis- 
tic CKY parsing algorithm was widened. 
The default beam pruned away chart en- 
tries whose scores were not within a fac- 
tor of e -5 of the top-ranked subtree; this 
4 
Model, test set 
BBN-allt, WSJ-all 
BBN-small-h WSJ-small* 
BBN, Xinhua:~ 
Chiang-all, WSJ-all 
Chiang-small, WSJ-small 
Chiang, Xinhua 
BBN-allt, WSJ-all 
BBN-smallI, WSJ-small* 
BBN, Xinhua:~ 
Chiang-all, WSJ-all 
Chiang-small, WSJ-small 
Chiang, Xinhua 
<40 words 
LR LP CB 0CB <2CB 
84.7 86.5 1.12 60.6 83.2 
79.0 80.7 1.66 47.0 74.6 
69.0 74.8 2.05 45.0 68.5 
86.9 86.6 1.09 63.2 84.3 
78.9 79.6 1.75 44.8 72.4 
76.8 77.8 1.99 50.8 74.1 
-< 100 words 
LR LP CB 0CB _<2CB 
83.9 85.7 1.31 57.8 80.8 
78.4 80.0 1.92 44.3 71.3 
67.5 73.5 2.87 39.9 61.8 
86.2 85.8 1.29 60.4 81.8 
77.1 78.8 2.00 43.25 70.5 
73.3 74.6 3.03 44.8 66.8 
Table 1: Results for both parsing models on all test sets. Key: LR = labeled recall, LP = labeled 
precision, CB = avg. crossing brackets, 0CB = zero crossing brackets, <2CB = <2 crossing 
brackets. All results are percentages, except for those in the CB column, tUsed larger beam 
settings and lower unknown word threshold than the defaults. *3 of the 400 sentences were not 
parsed due to timeouts and/or pruning problems. :~3 of the 348 sentences did not get parsed due 
to pruning problems, and 2 other sentences had length mismatches (scoring program errors). 
tight limit was changed to e -9. Also, the 
default decoder pruned away all but the 
top 25-ranked chart entries in each cell; 
this limit was expanded to 50. 
3 Experiments and Results 
The Chinese Treebank consists of 4185 sen- 
tences of Xinhua newswire text. We blindly 
separated this into training, devtest and test 
sets, with a roughly 80/10/10 split, putting 
files 001-270 (3484 sentences, 84,873 words) 
into the training set, 301-325 (353 sentences, 
6776 words) into the development test set and 
reserving 271-300 (348 sentences, 7980 words) 
for testing. See Table 1 for results. 
In order to put the new Chinese Treebank 
results into context with the unmodified (En- 
glish) parsing models, we present results on 
two test sets from the Wall Street Journal: 
WSJ-all, which is the complete Section 23 (the 
de facto standard test set for English pars- 
ing), and WSJ-small, which is the first 400 
sentences of Section 23 and which is roughly 
comparable in size to the Chinese test set. 
Furthermore, when testing on WSJ-small, we 
trained on a subset of our English training 
data roughly equivalent in size to our Chinese 
training set (Sections 02 and 03 of the Penn 
Treebank); we have indicated models trained 
on all English training with "-all", and mod- 
els trained with the reduced English train- 
ing set with "-small". Therefore, by compar- 
ing the WSJ-small results with the Chinese 
results, one can reasonably gauge the perfor- 
mance gap between English parsing on the 
Penn Treebank and Chinese parsing on the 
Chinese Treebank. 
The reader will note that the modified BBN 
model does significantly poorer than (Chiang, 
2000) on Chinese. While more investigation is 
required, we suspect part of the difference may 
be due to the fact that currently, the BBN 
model uses -specific rules to guess 
part of speech tags for unknown words. 
4 Conclusions and Future Work 
There is no question that a great deal of care 
and expertise went into creating the Chinese 
Treebank, and that it is a source of important 
grammatical information that is ufiique to the 
Chinese . However, there are definite 
similarities between the grammars of English 
and Chinese, especially when viewed through 
the lens of the statistical models we employed 
here. In both s, the nouns, adjec- 
tives, adverbs, and verbs have preferences for 
certain arguments and adjuncts, and these 
preferences--in spite of the potentially vastly- 
different configurations of these items--are ef- 
fectively modeled. As discussed in the intro- 
duction, lexica! items' idiosyncratic parsing 
preferences are modeled by lexicalizing the 
grammar formalism, using a lexicalized PCFG 
in one case and a lexicalized stochastic TAG 
in the other. Linguistically-reasonable inde- 
pendence assumptions are made, such as the 
independence of grammar productions in the 
case of the PCFG model, or the independence 
of the composition operations in the case of 
the LTAG model, and we would argue that 
these assumptions are no less reasonable for 
the Chinese grammar than they are for that 
of English. While results for the two s 
are far from equal, we believe that further tun- 
ing of the head rules, and analysis of develop- 
ment test set errors will yield significant per- 
formance gains on Chinese to close the gap. 
Finally, we fully expect that absolute perfor- 
mance will increase greatly as additional high- 
quality Chinese parse data becomes available. 
5 Acknowledgements 
This research Was funded in part by NSF 
grant SBR-89-20230-15. We would greatly 
like to acknowledge the researchers at BBN 
who allowed us to use their model: Ralph 
Weischedel, Scott Miller, Lance Rarnshaw, 
Heidi Fox and Sean Boisen. We would also 
like to thank Mike Collins and our advisors 
Aravind Joshi and Mitch Marcus. 

References 
Daniel M. Bikel, Richard Schwartz, Ralph 
Weischedel, and Scott Miller. 1997. Nymble: 
A high-performance learning name-finder. In 
Fifth Conference on Applied Natural Language 
Processing, pages 194-201,, Washington, D.C. 
Daniel M. Bikel. 2000. A statistical model for 
parsing and word-sense disarnbiguation. In 
Joint SIGDAT Conference on Empirical Meth- 
ods in Natural Language Processing and Very 
Large Corpora, Hong Kong, October. 
Eugene Charniak. 2000. A maximum entropy- 
inspired parser. In Proceedings of the 1st Meet- 
ing of the North American Chapter of the As- 
sociation for Computational Linguistics, pages 
132-139, Seattle, Washington, April 29 to May 
4. 
David Chiang. 2000. Statistical parsing with an 
automatically-extracted tree adjoining gram- 
mar. In Proceedings of the 38th Annual Meeting 
of the Association for Computational Linguis- 
tics. 
Kenneth Church. 1988. A stochastic parts pro- 
gram and noun phrase parser for unrestricted 
text. In Second Conference on Applied Natural 
Language Processing, pages 136-143, Austin, 
Texas. 
Michael Collins. 1997. Three generative, lexi- 
calised models for statistical parsing. In Pro- 
ceedings of ACL-EACL '97, pages 16-23. 
Michael Collins. 2000. Discriminative reranking 
for natural  parsing. In International 
Conference on Machine Learning. (to appear). 
Aravind K. Joshi and Yves Schabes. 1997. Tree- 
adjoining grammars. In A. Salomma and 
G. Rosenberg, editors, Handbook of Formal 
Languages and Automata, volume 3, pages 69- 
124. Springer-Verlag, Heidelberg. 
Mitchell P. Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building a 
large annotated corpus of English: The Penn 
Treebank. Computational Linguistics, 19:313- 
330. 
Scott Miller, Heidi Fox, Lance Ramshaw, and 
Ralph Weischedel. 1998. SIFT - Statistically- 
derived Information From Text. In Seventh 
Message Understanding Conference (MUC-7), 
Washington, D.C. 
Owen Rainbow, K. Vijay-Shanker, and David 
Weir. 1995. D-tree grammars. In Proceedings 
of the 33rd Annual Meeting of the Assocation 
for Computational Linguistics, pages 151-158. 
Adwait Ratnaparkhi. 1997. A simple introduc- 
tion to maximum entropy models for natural 
 processing. Technical Report 1RCS 
Report 97-08, Institute for Research in Cogni- 
tive Science, May. 
Philip Resnik. 1992. Probabilistic tree-adjoining 
grammar as a framework for statistical nat- 
ural  processing. In Proceedings of 
COLING-92, pages 418-424. 
Yves Schabes and Stuart M. Shieber. 1994. An 
alternative conception of tree-adjoining deriva- 
tion. Computational Linguistics, 20(1):91-124. 
Yves Schabes. 1992. Stochastic lexicalized 
tree-adjoining grammars. In Proceedings of 
COLING-92, pages 426-432. 
Fei Xia. 1999. Extracting tree adjoining gram- 
mars from bracketed corpora. In Proceedings 
of the 5th Natural Language Processing Pacific 
Rim Symposium (NLPRS-99). 
