Parsing Chinese with an Almost-Context-Free Grammar 
Xuanyin Xia and Dekai Wu 
HI( US T 
Department of Computer Science 
University of Science and Technology 
Clear Water Bay, Hong Kong 
{ samxia, dekai}©cs, ust. hk 
Abstract 
We describe a novel parsing strategy we are 
employing for Chinese. We believe progress in 
Chinese parsing technology has been slowed 
by the excessive ambiguity that typically 
arises in pure context-free grammars. This 
problem has inspired a modified formalism 
that enhances our ability to write and main- 
tain robust large grammars, by constraining 
productions with left/right contexts and/or 
nonterminal functions. Parsing is somewhat 
more expensive than for pure context-free 
parsing, but is still efficient by both theoret- 
ical and empirical analyses. Encouraging ex- 
perimental results with our current grammar 
are described. 
Introduction 
Chinese NLP is still greatly impeded by the 
relative scarcity of resources that have al- 
ready become commonplace for English and 
other European languages. A strategy we 
are pursuing is the use of automatic meth- 
ods to aid in the acquisition of such resources 
(4, 5, 6). However, we are also selectively en- 
gineering certain resources by hand, for both 
comparison and applications purposes. One 
such tool that we have been developing is 
a general-purpose bracketer for unrestricted 
Chinese text. In this paper we describe an 
approach to parsing that has evolved as a re- 
sult of the problems we have encountered in 
making the transition from English to Chinese 
processing. 
We have found that the primary obstacle 
has been the syntactic flexibility of Chinese, 
coupled with an absence of explicit marking by 
morphological inflections. In particular, com- 
pounding is extremely flexible in Chinese, al- 
lowing both verb and noun constituents to be 
arbitrarily mixed. This creates extraordinary 
difficulties for grammar writers, since robust 
rules for such compound forms tend to also 
accept many undesirable forms. This creates 
too many possible parses per sentence. We 
employ probabilistic grammars, so that it is 
possible to choose the Viterbi (most probable) 
parse, but probabilities alone do not compen- 
sate sufficiently for the inadequacy of struc- 
tural constraints. 
There are two usual routes: either (1) 
keep the context-free basis but introduce finer- 
grained categories, or (2) move to context- 
sensitive grammars. The former strategy in- 
cludes feature-based grammars with weak uni- 
fication. One disadvantage of this approach is 
that some features can become obscure and 
cumbersome. Moreover, the expressive power 
remains restricted to that of a CFG, so certain 
constraints simply cannot be expressed. Thus, 
many systems opt for some variety of context- 
sensitive grammar. However, it is easy for 
parsing complexity in such systems to become 
impractical. 
We describe an approach that is not quite 
context-free, but still admits acceptably fast 
Earley-style parsing. A benefit of this ap- 
proach is that the form of rules is natural and 
simple to write. We have found this approach 
to be very effective for constraining the types 
of ambiguities that arise from the compound- 
ing flexibility in Chinese. 
In the remainder of this paper, we first de- 
scribe our grammar framework in Sections 2- 
4. The parsing strategy is then described in 
Section 5, followed by current experimental re- 
sults in Section 6. 
13 
The Grammar Framework 
We have made two extensions to the form of 
standard context-free grammars: 
1. Right-hand-side contexts 
2. Nonterminal functions 
We would like to note at the outset that 
from the formal language standpoint, the com- 
plications introduced by the form of our pro- 
duction rules have so far hindered theoreti- 
cal analyses of the formal expressiveness char- 
acteristics of this grammar. Because of the 
nature of the constraints, it is unclear how 
the expressiveness relates to, for example, 
the more powerful unification-based gram- 
mars that are widespread for English. 
At the same time, however, we will show 
that the natural format of the rules has greatly 
facilitated the writing of robust large gram- 
mars. Also, an efficient Earley-style parser 
can be constructed as discussed below for 
grammars of this form. For our applica- 
tions, we therefore feel the effectiveness of the 
grammar form compensates for the theoretical 
complications. 
We now describe the extensions, but first de- 
fine some notation used throughout the paper. 
A traditional context-free grammar (CFG) is a 
four-tuple G = (N, ~, P, S), where N is a finite 
set of nonterminal symbols, ~ is a finite set of 
terminal symbols such that N N ~ = O, p is a 
finite set of productions and S E N is a spe- 
cial designated start symbol. Productions in P 
are denoted by symbol Pr, 1 < r < IPI, and 
have the form Dr ~ Zr,I Z,. 2 " " " Zr,~r~ , ~'r ~ O, 
whereDr ENandZr,j E NU~, l_~ j ~_~rr. 
Right-hand-side contexts 
We introduce right-hand-side contexts to im- 
prove rule applicability decisions for complex 
compounding phenomena. The difficulty that 
ordinary CFGs have with complex compound- 
ing phenomena can be seen from the following 
example grammar fragment: 
1. RelPh~ NP vn ~(de) 
2. Nom ~ NP vn ~J(de) 
3. NP ~ Nom 
4. NP ~ RelPh NP 
5. NP ~ NP NP 
Here, RelPh is a relative phrase, Nom is a 
nominalization (similar to a gerund), vn is lex- 
ical verb category requiring an NP argument, 
and ~J(de) is a genitive particle. 
The sequence 
(1) a. ~ ~L:~ ~J ~ 
b. j~ngwfichfi t~g6ng de dgdfi 
c. police provide -- answer 
d. the answer provided by police 
can be parsed either by 
\[\[\[~\] Np\[~,\] vn ~\] RelPh \[~\] NP\] NP 
or by 
\[\[\[\[~-~\] Np\[~.\] vn~J\] Nom\] NP\[ ~\] NP\] NP 
However the latter parse is not linguisti- 
cally meaningful, and is rather an artifact of 
the overly general noun compounding rule 5. 
The problem is that it becomes quite cum- 
bersome in a pure CFG to specify accurately 
which types of noun phrases are permitted to 
compound, and this usually leads to excessive 
proliferation of features and/or nonterminal 
categories. 
Instead, the approach described here aug- 
ments the CFG rules with a restricted set of 
contextual applicability conditions. A pro- 
duction in our extended formalism may have 
left and/or right context, and is denoted as 
Pr = {L}Zr,lZr,2 "''Zr,Trr{R}, where L,R E 
(N U E)* and the left context condition L and 
the right context condition R are of a form 
described below. These context conditions 
help cut the parser's search space by elimi- 
nating many possible parse trees, increasing 
both parsing speed and accuracy. Though 
ambiguities remain, the smaller number of 
parses per sentence makes it more likely that 
most-probable parsing can pick out the cor- 
rect parse. 
Nonterminal functions 
In addition, a second extension is the intro- 
duction of a variety of nonterminal functions 
that may be attached to any nonterminal or 
terminal symbol3 These functions are de- 
1The term nonterminal \]unctions was chosen 
for mnemonic purposes; it is actually a misnomer 
since they can be apphed to terminal symbols as 
well. 
14 
signed to facilitate natural expression of con- 
ditions for reducing ambiguities. Some of the 
functions are simply notational sugar for stan- 
dard CFGs, while others are context-sensitive 
extensions. These functions are list in the fol- 
lowing sections. By convention, we will use a 
and b for symbols that can be either terminals 
or nonterminals, c for terminal symbols only, 
d for the semantic domain of a terminal, and 
i for an integer index. 
The not function 
The not function is denoted as /!b, which 
means any constituent not labeled b. Note 
that this feature must not be used with rules 
that can cause circular derivations of the type 
A =V* A, since this would lead to a logical 
contradiction. 
In the previous example, if we change rule 
2 to 
Nom ~ NP vn ~ {/!NP) 
the new right condition/!NP prevents rule 2 
from being used within cases such as rule 5, 
where the immediately following constituent 
is an NP. This causes the the correct parse to 
be chosen: 
\[\[\[~\] Np\[~,\] vn\] RelPh \[~\] NP\] NP 
We have only found this function useful 
for left and right contexts, rather than the 
main body of production right-hand-sides. 
The excluded-category function 
The excluded-category function is denoted as 
a/!b that means a constituent labeled a, 
which moreover cannot be labeled as b. Again, 
not to be used with rules that can cause cir- 
cular derivations. 
The main purpose of the excluded- 
category function is to improve robustness 
when the grammar coverage inadequacies pre- 
vent a full parse tree from being found. In 
such cases, our parser will instead return a 
partial parse tree, as discussed further in Sec- 
tion 5. The excluded-category function can 
help improve the chances of choosing the cor- 
rect rules within the partial parse tree. 
For example, consider its use with the 
verb phrase construction 
NP verb (Obj) 
which is known as the ~t~(ba)-construction. If 
the verb has part of speech vn, then it is mono- 
transitive and only one object is needed to 
form a VP, but if the verb is a ditransitive 
vnn, then a second object is needed to form 
the VP. 
An example of the monotransitive case is ~ 
(2) a.~ ~ y 
b. b~ f£u ch~ le 
c. -- food eat -- 
d. have eaten the food 
while an example of the ditransitive case is 
(3) a.~J~ T 
b. b~t f£u sbng rdn le 
c. -- food give somebody -- 
d. give food to somebody 
The former phrase can be correctly parsed by 
the monotransitive rule 
VP ~ ~ NP vn 
Suppose that the parser is unable to find any 
full parse tree for some sentence that includes 
the latter phrase. The above monotransitive 
rule would still be considered by the parser, 
since it is performing partial parsing, and this 
rule matches the subsequence ~\[~ ~ ~. In fact 
this is not the correct rule for the ditransitive 
phrase--the VP is not ~ ~ ~ but rather 
g~ ~ J~ ~--but we would not be able to dis- 
tinguish the monotransitive and ditransitive 
cases ~ ~.~ ~g and ~\[~ ~ ~, because both ~g 
and ~ can have part of speech vn. Thus the 
monotransitive subparse might incorrectly be 
chosen for the partial parse output (whether 
this happens depends rather arbitrarily on the 
possible subparses found over the rest of the 
sentence). 
The key to eliminating the incorrect pos- 
sibility altogether is that only ~ can also have 
the part of speech vnn. We refine the rule with 
our excluded-category function: 
VP ~ ~ NP vn//vnn 
2For this and all subsequent examples, (a) is 
the Chinese written form, (b) is its pronuncia- 
tion, (c) is its word gloss ('--' means there is no 
directly corresponding word in Engfish), and (d) 
is its approximate English translation. 
15 
The monotransitive phrase can still be parsed 
by this new rule since ~ cannot have the part 
of speech vnn: 3 
\[~\[~\[~\] Np\[~Y~\] vn\] Vp J'. 
But because ~ can be labeled as either vn or 
vnn, it does not match vn//vnn, and therefore 
the rule cannot be applied to the ditransitive 
phrase. This leaves the ditransitive produc- 
tion 
VP~ ~ NP vnn NP 
as the only possibility, forcing the correct sub- 
parse to be chosen here. In a sense, this func- 
tion allows a measure of redundancy in the 
grammar specification and thereby improves 
robustness. 
The substring-linking function 
The substring-linking function is denoted a/i. 
This is used to remember the string that was 
matched to a constituent a, so that the string 
can be compared to a subsequent appearance 
of a/i in the same production. In general, we 
may have several occurrences of the same non- 
terminal, and it is occasionally useful to be 
able to constrain those occurrences to match 
exactly the same string. 
One important use of substring-linking in 
Chinese is for reduplicative patterns. Another 
use can be seen in the following two sentences: 
(4) a. ~ ~ ~ f~ ~ ~ g 
b. t~ zub bfl zu6 zh& ji~n sh\] 
c. he do not do this -- thing 
d. will he do this thing 
(5) a. ~ ~ :~ ~iJ ~ ~ ;~ 
b. ta zub bh d£o zh~ ji£n sh~ 
c. he do not do this -- thing 
d. he cannot do this thing 
Let us consider two sequences {~ ~ {~ and {~ 
~I\] in (4) and (5) respectively, where {5 and 
ill can both be labeled as vn, but they have a 
different role. The former indicates a question, 
and the latter a negative declaration; clearly 
the parses must differentiate these two cases. 
If the only rule in the grammar to handle 
these examples is 
3The -~ character is an aspect particle. 
question_verb ~ vn ~ vn 
then the two sequences will be parsed iden- 
tically. However, with the substring-linking 
function we can refine the rule to 
queslion_verb ~ vn/1 Yg vn/1 
Now the first vn/1 is defined as (~ in both 
cases when the first {~ is parsed. For the 
first sequence, the second ~ matches the sec- 
ond vn/1 when it is compared to the earlier- 
defined value of vn/1. Because the substrings 
match, the first sequence can be parsed by this 
rule as 
\[\[¢~\] vn~\[~\] ~n\] q=~io~_~rb 
In contrast, for the second sequence, when ~sJ 
is compared with the defined value of vn/1 
-- f~ -- they are different, and therefore the 
second sequence cannot be parsed by the rule. 
In this example, the defined value of a 
nonterminal is only one word. However, in 
the general case it can be an arbitrarily long 
string of words spanned by a nonterminal (vnl 
in this example). 
The semantic-domain function 
The semantic-domain function is denoted by 
c/$zd and designates a terminal c whose se- 
mantic domain is restricted to d. This is an 
ordinary feature, that we use in conjunction 
with the BDC dictionary which defines seman- 
tic domains. 
Given two sentences, 
(6) a. ~ ~" ~ ~ 
b. zki gu~ngd6ngsh@ng de t6uz~ 
c. in Guangdong -- investment 
d. the investment in Guangdong Province 
(7) a. ~ 'J'~E ~ 
b. z~i xi~ozh~ng de ji~ 
c. in XiaoZhang -- house 
d. in XiaoZhang's house 
they have the same surface structure 
NP ~J NP 
but they are quite different. In (6), :~ ~ 
-~ is the modifier of ~. In (7), tJx~ is a 
modifier of 5, and they together form a NP 
as the object of ~. 
16 
It is very hard to distinguish these two 
cases in general. With traditional CFGs, this 
is problematic because both ~-~i" and ,'J~ 
have the part of speech up, and both ~.~ and 
have part of speech nc. We can do a some- 
what better job by using the domain knowl- 
edge supplied by a dictionary with semantic 
classes. 
The difference between the two phrases is 
that although ~-~" and ~ are both loca- 
tion nouns, not all NPs following a ~ can be 
formed into locative phrase--only if the head 
noun of the NP is a location noun can it can 
be parsed as a locative phrase. (6) is parsed 
as 
\[\[\[:~\[~\] NP\] LocPh ~\] ModPh \[~\] NP\] NP 
because :~£ ~g~" is a locative phrase, where 
LocPh stands for locative phrase, and ModPh 
stands for modifier phrase. But in (7), the 
entire phrase :i~ dx~ ~J ~ forms a locative 
phrase, and is parsed as 
\[:~ \[\[\['J'~ NP ~I'~\] ModPh \[~\] NP\] NP\] LocPh 
The key point here is how to define a lo- 
cation noun. We have rules 
and 
localion_noun ---* np/gJGE 
location_noun ---+ nc/g~GE 
where GE is the abbreviation of geology. Be- 
cause the domain of ~" is GE, it is parsed 
as a location_noun, and together with the 
leader ~ is parsed as a locative phrase. But 
~J~ cannot be parsed as a locative phrase 
with the leader ~ since its domain is not 
GE; instead it is parsed as the modifier of 
, at which point the parser will further check 
whether :i~ plus ~J~ ~ ~ can be parsed as a 
locative phrase. 
The has-subconstituent function 
This function is denoted as a/@b, which 
means a constituent labeled a with any de- 
scendant of category b, where a is a nontermi- 
nal and b can be either a terminal or a nonter- 
minal. In other words, this matches an inter- 
nal node labeled a, which has a subtree with 
root labeled b. 
Consider the two sentences 
(s) a. 4~-~ 7 ~ ~ ~ 
b. t~ xu~ le li-~ng g~ :~ngq{ 
c. he learn -- two -- week 
d. he has learned it for two weeks 
(9) 
b. t~ xu~ le li~ng pi~n k~w~n 
c. he learn -- two -- lesson 
d. he has learned two lessons 
In Sentence (8), ~ ~ ~\[~ is the comple- 
ment of-~-, while in Sentence (9), ~ ~ -~ 
is the object of ~. However, both NPs 
~ ~ and ~ ~ ~ superficially have 
the same structure, and the parser may assign 
Sentence 8 the wrong parse tree 
\[\[~\] Np\[\[--~-\] vn T \[\[~ ~\] ClPh\[~ \] NP\] NP\] VP\] clause 
instead of the correct one 
\[\[~\] Np\[\[-~\] vn~ \[\[\[\[~ ~\] CIPh\[\[~ \]time_particle 
\] NP\] NP\] TP\] Comp\] VP\] clause 
where ClPh stands for classifier phrase, TP 
stands for time phrase, and Comp stands for 
the complement of a verb. 
The difference between them lies in that 
~ is a time particle, and therefore is parsed 
with its classifier ~ ~ as a time phrase, 
whereas -~ is a general noun, and is parsed 
with its classifier ~ ~ as a general NP. 
With the rule 
time_phrase --~ NP/@time_particle 
we can parse ~ ~ ~\] as a time phrase, and 
since it is a time phrase, it will be parsed as 
the complement of ~a. But becase ~ ~ ~5~ 
is a just general NP, it can not be parsed with 
this rule, and it will serve only as the object 
of ~. 
Earley Parsing 
We use a generalization of the Earley algo- 
rithm (3, 2) to parse grammars of our form. 
Although the time complexity rises compared 
to the Earley algorithm, it remains polynomial 
in the worst case. 
17 
Algorithm 
The key to modifying the Earley algorithm to 
handle the left and right context conditions is 
that our rules can be rewritten into a full form 
which includes all symbols including the con- 
texts, plus indices indicating the left and/or 
right context boundaries. For example, let 
A~{L} B {R}andC~D E {R}betwo 
production rules. They are rewritten respec- 
tively as A ~ L B R, start = 2, len = 1 
andC~D E R, start = 1, len=2. Once 
this transformation has been made, the ma- 
chinery from the Earley algorithm carries over 
remarkably smoothly. 
The main loop of the parsing algorithm 
employs the following schema. 
1. Pop the first entry from the agenda; call the 
popped entry c. 
2. If c is already in chart, go to 1. 
3. Add c to chart. 
4. For all rules whose left corner is b, call 
match(b, c). If the return value is 1, add 
an initial edge e for that rule to chart; for 
all the chart entries (subtrees) d beginning 
at end(e)÷l, if g is the active symbol in the 
RHS (right-hand-side) of e and match(g, c') 
returns 1, then call extend(e, cl). 
5. If the edge e is finished, add an entry to the 
agenda. 
6. For all edges d, if g is the active symbol in 
the RHS of d and match(g, c) returns 1, 
then call extend(d, c) and add the resulting 
edge. 
7. Go to 1. 
extend(e, c): (extends an edge c with the chart 
entry (subtree) c) 
1. Create a new edge e'. 
2. Set start(e') to start(e). 
3. Set end(e') to end(e). 
4. Set rule(e') to rule(e) with. moved beyond 
C. 
5. If the edge e / is finished (i.e., a subtree) then 
add e I to the agenda, else for all chart sub- 
trees c I beginning at end(el)+1, if g is the 
active symbol in the RHS of e I and match(g, 
c') returns 1, call extend(e I, c'). 
match(g,c): (checks whether a subtree c can 
be matched by a symbol g) 
1. If c's category does not equal to g's cate- 
gory, return 0. 
2. Check whether g's associated functions are 
satisfied by c -- 
(a) If g has the form a/!b or /!b, check all 
the entries in the chart that span the 
same range as c, returning 0 if any have 
category b. 
(b) If g has the form a/i, if a/i is not defined, 
link it to c and return 1. Otherwise, com- 
pare c with the defined value of aft; if not 
the same, return 0. 
(c) If g has the form c/&d, if the semantic 
domain of c is not d, return 0. 
(d) If g has the form a/@b, check all the 
nodes of the subtree c; if no node of cat- 
egory b is found, return 0. 
3. Return 1. 
The difference from standard Earley pars- 
ing (aside from the rule transformation men- 
tioned above) lies is in match. To check 
whether an entry matches the left corner of a 
rule or whether an edge can be extended by an 
entry, we need to check not only that the cat- 
egory of the constituent is matched, but also 
that the attached function if any is satisfied. 
Recall that our application for the pars- 
ing algorithm is as the first stage of a ro- 
bust bracketer. We therefore use an extension 
of this parsing approach that permits partial 
parsing. In this version, if the sentence cannot 
be parsed, a minimum-size subset of subtrees 
that cover the entire sentence is produced. 
In the following, we will use an example 
sentence to demonstrate how the algorithm 
works. The sentence and the grammar we use 
here are oversimplified, but show how a right 
context is handled. 
The sentence to be parsed is 
(10) a. ~ ~ fl,~ :~ 
b. t~ m~i de y~ffi 
c. he buy - clothes 
d. the clothes bought by him 
and the grammar is 
1. NP ~ pron 
2. NP---~ nc 
3. RelPh---~ NP vn ~l~ (NP} 
4. NP--~ RelPh NP 
18 
5. pron ~ ~ 
6. nc ~ ~ 
7. vn---~ 
The first portion of the parsing for this ex- 
ample is identical to standard Earley parsing. 
We pop the first the entry from the agenda, ~ 
, and since it is not already there we add it to 
the chart. The only initial edge to be added 
is 
pron ~ ~ - 
Since this edge is finished, we add it to the 
agenda. 
Next we pop pron from agenda, create an 
initial edge 
NP ~ pron • 
and find it is also finished, and so add the NP 
to the agenda. 
Again we pop NP from the agenda, and 
create the initial edge 
RelPh --~ NP vn ~ { NP} 
We find this edge cannot be extended by any 
entry and is not finished, so we go to step 1 
and pop the next entry ~ from the agenda. 
We continue this step until we pop :~ 
from the agenda, and add nc and later NP to 
the agenda. Up to this point, all we are doing 
is standard Earley parsing. 
Now we pop NP which spans :~n~ from 
the agenda, and find that the edge 
RelPh -+ NP vn t\]'~ { NP} 
can be extended by this entry. We find the 
extended edge is finished, so we add the RelPh 
to the agenda, then pop it, creating a new edge 
NP ~ ReIPh NP 
An entry (subtree) NP which spans ;iJ~\]~ 
is already in the chart when the last edge is 
created. Thus the last edge can be extended, 
creating a finished edge, so we have created 
an subtree NP that spans the whole sentence. 
Since there is now a nonterminal that spans 
the whole sentence, we can write down a parse 
tree of the sentence in a subscripted bracket 
form as 
\[\[\[\[~\]pr0 n\] Np\[~\] vn ~J\] RelPh \[\[:~\] n c\] NP\] NP 
We do not yet have a tight upper-bound 
for this parsing algorithm in the worst case. 
Clearly the algorithm will be more time con- 
suming than for CFGs because the match pro- 
cedure will need to check not only the cate- 
gories of the constituents, but also their asso- 
ciated functions, and this check will not tak@ 
constant time as for CFGs. 
But though the algorithm is clearly worse 
than CFG in the worst case, in practice, the 
complexity in practice will depend heavily on 
particular sentences and the grammar. The 
number and type of context conditions used 
in the grammar, and the kind of nonterminal 
functions, will greatly affect the efficiency of 
parsing. Thus empirical performance is the 
true judge, and our experience as described 
next has been quite encouraging. 
Results 
We are currently developing a robust gram- 
mar of this form for the Chinese bracketing 
application. Although the number of rules is 
changing daily, the evaluation was performed 
on a version of the grammar containing 948 
rules. The lexicon used was the BDC dictio- 
nary containing approximately 100,000 entries 
with 33 part of speech categories (1). 
To evaluate our progress, we have evalu- 
ated precision on a previously unseen sample 
of 250 sentences drawn from our corpus, which 
contains Hong Kong legislative proceedings. 
The sentences were randomly selected in var- 
ious length ranges of 4-10, 11-20, 21-30, 31- 
40, and 41-50 words, such that each of the 
five ranges contained 50 sentences. All those 
sentences were segmented by hand, though we 
will use an automatic segmenter in the future. 
We evaluated three factors: 
. 
. 
The percentage of labeled words. A word is 
unlabeled if it can not form deeper structure 
with at least one other word. Unlabeled 
words often indicate inadequacies with lex- 
icon coverage rather than the grammar. 
Weighted constituent precision, i.e., the per- 
centage of incorrectly identified syntactic 
constituents. A constituent is judged to be 
correct only if both its bracketing and its 
syntactic label are correct. 
Because we don't give a single parse tree if 
there is for a sentence at the current stage, 
we uniformly weight the precision over all 
the parse trees for the sentence. Therefore 
this measure is a kind of weighted precision 
(6). 
19 
O: (final (clause (clause (advph (sadv ~ ) , ) (clause (nounph (nounph (noun (pron ~J~ )) (noun (nc ~fi~ )))) (verbph (zaiph ~ (nounph (modph (relph (nounph (noun (up ~ ))) (vppart (vn 
(vadv ~ ) (vn ~ ))) fl-~ )) (nounph (modph (aa (vil -~ ))) (nounph (noun (nc ~:~ ))))) (locat_part ~ ))))) (punc , ) (clause (verbph (vn (auxvb (aux ~ )) (vn ~ )) (nounph (assocph 
(nounph (d ~.~ ) (nounph (noun (nc ~Ji~i )))) fl'~ ) (nounph (noun (nc ~P-4 ))))))) o ) 
O: (final (clause (clause (advph (sadv :~}~ ) , ) (clause (nounph (nounph (noun (pron ~J~ )) (noun (nc ~ )))) (verbph (zaiph ~ (nounph (modph (relph (nounph (noun (up ~-~4~ ))) (vppart (vn 
(vadv ~1~-~ ) (vn ~{~ ))) t~J )) (nounph (modph (aa (vil -~- ))) (nounph (noun (nc ~:~ ))))) 
(locat_part ~ ))))) (punc, ) (clause (verbph (vn (auxvb (aux ~ )) (vn f~ )) (nounph (assocph (nounph (d ~ ) (nounph (noun (nc 9~ )))) ~ ) (nounph (noun (nc ~P-4 ))))))) o ) 
O: (final (clause (clause (advph (sadv ~i~ ) , ) (clause (nounph (d ~\] ) (nounph (noun (nc Jk )))) (cjs ~ ) (verbph (vn (vadv ~iE ) (vn ~ )) (nounph (noun (nc I~ )))))) (punc,) (clause (verbph 
(verbph (vn (vadv ~,\[1 ) (vn (auxvb (aux ~A )) (vn ~ ))) (nounph (noun (nc ~ )))) (verbph 
(vn ~ ) (nounph (noun (nc ~ ))))))) o ) 
O: (final (clause (nounph (noun (nc iTi~ ))) (clause (clause (nounph (clph (d ~ ) (cl 
(auxvb (aux o/)) (vn ~ )) (nounph (noun 
(verbph (vs (vadv ~ ) (vs (vadv ~:~ ) (vs ~ ))) )) (nounph (noun (nc ~j~ )))) (verbph (verbph (vn 
(up ~&/~j )))) (verbph (vil ~g~ )))) ~ ))) ? ) 
(nounph (nounph (noun (nc ~)~ )) (noun (nc ~ )))) , (clause (nounph (pron ~J~ )) (verbph (vil 
(neg ~ ) (vil ~1\]~ )))) (verbph (covph (p ~ ) (nounph (pron ~J~ ))) (verbph (vn ~ ) (nounph (clph (d ~ ) (cl ~ )) (nounph (noun (nc I~ )))))) , (clause (verbph (vil (vadv ~:~ ) (vil (vadv 
) (vil ~ (vil ~ )))))) o 
(nounph (nounph (noun (nc ~1~ )) (noun (nc ~J~ )))) , (clause (advph (sadv ~ )) (clause (nounph (pron ~ )) (verbph (vv ~ ) (verbph (covph (p ~,~ ) (nounph (modph (relph (nounph (d 
) (nounph (noun (nc ~\]~ )))) (vppart (verbph (vn ~,-~\] ) (nounph (noun (up ~Jx~\]t )) (noun (nc ~ )))) (vn (vadv ~ ) (vn ~ ))) ~ )) (nounph (modph (aa (a --t)J ))) (nounph (noun (nc ~J3 
))))) (punc,)) (verbph (vnv ~ ) (nounph (d ~ ) (nounph (nounph (noun (nc (~ff~)) (noun (nc -~- 
))))) (verbph (vil (vadv ~ ) (vil ~ )))))))) o 
(nounph (assocph (nounph (q --~=~ ) (noun (nc ~l~ ))) (¢3) (nounph (noun (nc 1\]~.~ )))) , (advph (sadv ffljPl:l )) (nounph (nounph (nounph (noun (nc ~ )) (noun (nc ~ )))) (cjw $~ ) (nounph 
(nounph (noun (nc ~Jk )) (noun (nc ~ ))))) , (clause (verbph (vn (vadv ~ ) (vn (auxvb (aux :~ ~ )) (vn i~l\] ))) (nounph (assocph (nounph (nounph (nounph (noun (pron ~J~ )) (noun (nc ,~'l~l 
)))) (locat_part ~ )) ~J ) (nounph (noun (nc x_k~ )))))) o 
O: (final (clause (clause (clause (nounph (q --~ ) (noun (nc ,~,~=~ ))) (verbph (vn ~ ) (nounph 
(modph (relph (vppart (vn ~ )) (nounph (noun (nc ~E\] )) (noun (nc .-~-~ ))) ~ )) (nounph 
(nounph (nounph (noun (nc ~t )) (noun (nc ~ ))) (noun (nc ~ ))))))) (punc , ) (clause (verbph (vs ~ ) (clause (nounph (pron ~J~ )) (verbph (covph (p ~ ) (nounph (d ~_ ) (nounph 
(noun (nc ~ ))))) (verbph (vn (vadv ~ ) (vn ~ )) (nounph (noun (nc ~t~ ))))))))) (punc, ) 
(clause (verbph (vil (vadv ~:~ ) (vil (vadv ~ ) (vil ~:/~ )))))) o ) 
(advph (sadv --~--~ ) , ) (clause (nounph (noun (up ~ ))) (verbph (vi2 ~/))) (verbph (covph (p ) (nounph (modph (aa (a Zk ))) (nounph (nounph (noun (nc ~ )) (noun (nc ~ )))))) (verbph 
(vn ~t~/~ ) (nounph (modph (attrph (aa (vil ~:)k: )) ~ )) (nounph (noun (nc ~1\]~ )))))) , ~ (nounph (d i_~ ) (nounph (noun (nc ~lJ~ )))) $ (clause (nounph (noun (nc ~i~ ))) (verbph (covph 
(p PA ) (nounph (noun (nc ~ )))) (verbph (vil ~il )))) o 
Figure 1: Examples of parse output (see text). 
20 
(clause (nounph (nounph (noun (nc ~::~ )) (noun (nc ~ ))) (noun (nc .~ ))) (verbph (vi2 (neg ) (vi2 (auxvb (aux ~ )) (vi2 ~ ))))) (nounph (assocph (nounph (nounph (noun (nc ~:~ )) 
(noun (up :~:~¢d~ ))) (noun (nc),. ))) ~ ) (nounph (noun (nc $lJ~ )))), (clause (verbph (covph (p ~ ) (nounph (modph (relph (vppart (vn ~fi~ )) (nounph (modph (aa (vil .~ ))) (nounph (noun (nc 
~'~3~ )))) f~9 )) (nounph (noun (nc ~ ))))) (verbph (vv 5~ ) (verbph (vn ~l~t ) (nounph (noun (nc 
~ ))))))) ~,~ o 
(clause (clause (clause (nounph (noun (up ~:~ ))) (verbph (is ~ ) (nounph (clph (q -- ) (cl ~I )) (nounph (modph (aa (vil ~ ))) (nounph (noun (nc ~*~ ))))))) (punc , ) (clause (verbph (vv ~1~ 
) (verbph (vn (vadv ~ ) (vn ~ )) (nounph (modph (aa (vil ~jt~ ))) (nounph (noun (nc hJ~ )))))))) (punc , ) (clause (verbph (vi2 (vadv ~ ) (vi2 ~ ))))) (clause (nounph (nounph (noun (up 
2R~ ))) (ejw ~ ) (nounph (noun (up li~/~P.~ )))) (verbph (vn ~ ) (nounph (modph (relph (vppart 
(vn ~j~ )) ~t.~ )) (nounph (nounph (noun (nc ~ )) (noun (ne W~ ))))))), (clause (nounph (modph (aa (a I~. ))) (nounph (noun (nc A:\[= )))) (verbph (verbph (vi2 (vadv ~¢~ ) (vi2 (vadv f~ ) (vi2 
g~ )))) (cjw ~ ) (verbph (vn (vadv ~t)J ) (vn ~J~ )) (nounph (assocph (nounph (noun (nc ~ 
)) (noun (nc 7~ ))) ~J ) (nounph (noun (nc ~)~ ))))))) o 
(clause (clause (clause (nounph (noun (up *A ))) (verbph (vnv {~ ) (nounph (noun (nc ~a ))) (verbph (vv ~-~ ) (verbph (covph (p ~ ) (nounph (nounph (noun (nc ~--~ )) (noun (nc ~ 
))))) (verbph (vv 5~ ) (verbph (vn }~ ) (nounph (noun (he ~ ))))))))) (punc,) (clause (verbph (advph (sadv ~ )) (verbph (vn {@~ ) (nounph (noun (nc ~tJ~y~ ))))))) (punc,)' (clause (verbph (vnv 
{E ) (nounph (noun (nc ~t~ ))) (verbph (eovph (p ~ ) (nounph (noun (nc ~-~ )))) (verbph (vi2 
~ )))))) (nounph (noun (nc ~ ))), (clause (nounph (noun (nc Y/- ))) (verbph (vn \]J~ ) (nounph 
(noun (nc ~ ))))) (nounph (modph (attrph (aa (a i~ )) ~ )) (nounph (modph (aa (vil -~,.~ ))) (nounph (noun (nc .~.~ ))))) , (clause (verbph (verbph (vn ~ ) (nounph (noun (nc \]I.~ )))) 
(verbph (vn ~\] ) (nounph (assocph (nounph (noun (nc ~lJ ))) ~J ) (nounph (noun (nc {~ ))))))) o 
(clause (cjs ~ ) (clause (nounph (noun (up ~ ))) (verbph (is (vadv ~ ) (is ~ )) (nounph (modph 
(aa (vil 3E~ ))) (nounph (nounph (noun (nc I~ )) (noun (nc ~,~lJ )))))))) ~AgJ" fl'.J (nounph (noun (nc I~-Y~ ))) , (clause (nounph (modph (aa (vil (vadv I~l~ ) (vil ~ )))) (nounph (noun (nc ~ )))) 
(verbph (locph (locph (zaiph :~ (nounph (modph (relph (vppart (vn (neg ~ ) (vn ~ ))) (nounph (d ~l~l~{t~ ) (nounph (modph (aa (vil ~E~ ))) (nounph (nounph (noun (nc 212~ )) (noun (nc :I:~.~. 
)))))) fl,~ )) (nounph (noun (nc ~b~ )))) (locat_part ~ ))) (punc , )) (verbph (covph (p ~ ) (nounph (d ~l~I~{t~ ) (nounph (noun (nc ~Jj ))))) (verbph (vn -~:~\]~ ) (nounph (modph (attrph (aa (a ~9\[" )) 
~J )) (nounph (nounph (noun (nc AJ~ )) (noun (nc ~. ))))))))) o 
(clause (clause (clause (clause (cjs ~ ) (clause (nounph (noun (nc '~j ))) (verbph (vv ~ ) (verbph (vn ~)~d~ ) (nounph (modph (aa (a ~1~ ))) (nounph (noun (nc ~J~ )))))))) (punc,) (clause 
(nounph (modph (aa (vil (neg ~6 ) (vii ~ )))) (nounph (noun (up I~ )))) (verbph (verbph (vn -~J~.~ ) (nounph (noun (he ~ )))) (verbph (vi2 .~gJJ ))))) (punc,) (clause (verbph (vn (vaav ~ ) 
(vn (auxvb (aux ~, )) (vn ~5~ ))) (nounph (clph (q -- ) (el ~ )) (nounph (modph (attrph (aa (vil ~,~ )) ~J )) (nounph (noun (nc ~,~ )))))))) (punc , ) (clause (verbph (eovph (p ~ ) (nounph 
(nounph (noun (nc 2\[sdt!! ))) (cjw ~ ) (nounph (nounph (nounph (noun (nc ~jg\[- )) (noun (up 2E~ ))) (noun (nc),jJ= )))))) (verbph (vn ~IJ~i ) (nounph (modph (attrph (aa (vii \[~ )) ~ )) (nounph 
(noun (nc ~.gS~ )))))))) (nounph (noun (nc ~li~ ))) , (clause (verbph (covph (p ~ ) (nounph (noun (up ~ )))) (verbph (vv (auxvb (aux ~A )) (vv ~-~ )) (verbph (vi2 ~-~ ))))) (nounph (noun (nc 
~)3 ))) o 
Figure 2: Examples of parse output (cont'd). 
21 
length of sentence 4-10 11-20 21-30 31-40 ~ 41-50 
% words labeled 83.10 99.61 95.67 94.82 95.45 
% correct constituents 85.41 83.57 81.23 80.20 78.85 
run time per sentence (secs.) 2.03 3.54 9.00 5.08 37.50 
Table 1: Evaluation results. 
In the future, we will give a single most 
probable parse tree for a sentence if it can 
be parsed. Note that the precision in this 
case is likely to be lower bounded by the 
weighted precision reported here, since we 
currently assign equal weight to all parses, 
even if they are improbable. 
3. The average run time per sentence. 
Results are shown in Table 1. We have 
unfortunately found it impossible to perform 
comparison evaluations against other systems, 
due to the unavailability of Chinese parsers 
in general. However, we believe these per- 
formance levels to be quite competitive and 
promising. 
Meaningful baseline evaluations are cur- 
rently difficult to design for Chinese parsing 
because of the unavailability of comparison 
standards. Examples of the Chinese output 
still give by far the most important indica- 
tion of parsing quality. Some representative 
examples are shown in Figures 2 and 2. The 
parser produces two kinds of outputs. If no 
complete parse tree is found for the input sen- 
tence, a partial parse is returned; such exam- 
ples are shown without a number preceding 
the parse. Otherwise, the first complete parse 
tree is shown, preceded by the number 0 (in- 
dicating that it was the first alternative pro- 
duced). 
Conclusion 
We have described an extension to context- 
free grammars that admits a practical pars- 
ing algorithm. We have found the notation 
and the increased expressiveness to be well- 
suited for writing large robust grammars for 
Chinese, particularly for handling compound- 
ing phenomena without incurring the level of 
parsing ambiguity common to pure context- 
free grammars. Experiments show promising 
performance on Chinese sentences. 
With regard to the theme of this confer- 
ence, we are clearly emphasizing representa- 
tion over algorithms. We have developed a 
new representation that neatly captures the 
domain characteristics, and in our experience, 
greatly improves the coverage and accuracy 
of our bracketer. Algorithms follow naturally 
as a consequence of the representational fea- 
tures. It will be interesting to explore the re- 
lationships between our grammar and other 
context-sensitive grammar formalisms, a topic 
we are currently pursuing. 

References 
\[1\] BDC. The BDC Chinese-English Elec- 
tronic Dictionary (version 2.0). Behavior 
Design Corporation, 1992. 
\[2\] Eugene Charniak. Statistical Language 
Learning. MIT Press, Cambridge, MA, 
1993. 
\[3\] Jay Earley. An efficient context-free 
parsing algorithm. Communications of 
the Association for Computing Machinery, 
13(2):94-102, 1970. 
\[4\] Dekai Wu. An algorithm for simultane- 
ously bracketing parallel texts by aligning 
words. In Proceedings of the 33rd Annual 
Conference of the Association for Compu- 
tational Linguistics, pages 244-251, Cam- 
bridge, Massachusetts, June 1995. 
\[5\] Dekai Wu. Trainable coarse bilingual 
grammars for parallel text bracketing. In 
Proceedings of the Third Annual Workshop 
on Very Large Corpora, pages 69-81, Cam: 
bridge, Massachusetts, June 1995. 
\[6\] Dekai Wu and Xuanyin Xia. Large- 
scale automatic extraction of an English- 
Chinese lexicon. Machine Translation, 
9(3-4):285-313, 1995. 
