A Block-Based Robust Dependency Parser for Unrestricted 
Chinese Text, 
Ming Zhou 
Microsoft Research China, 
Sigma Centre, 49#, Zhichun Road, 
100080, Beijing, China 
mingzhou@microsoft.com 
Abstract 
Although substantial efforts have been made 
to parse Chinese, very few have been 
practically used due to incapability of 
handling unrestricted texts. This paper 
realizes a practical system for Chinese 
parsing by using a hybrid model of phrase 
structure partial parsing and dependency 
parsing. This system showed good 
performance and high robustness in parsing 
unrestricted texts and has been applied in a 
successful machine translation product. 
Introduction 
Substantial efforts have been made to parse 
western s such as English, and many 
powerful computational models have been 
proposed (Gazdar, et al, 1987, Tomita, M, 1986). 
However, very limited work has been done with 
Chinese. This is mainly due to the fact that the 
structure of the Chinese  is quite 
different from English. Therefore the 
computational model in processing English may 
not be directly applied to the Chinese . 
Lin-Shan Lee et al (1991) proposed a Chinese 
natural  processing system with special 
consideration of some typical phenomena of 
Chinese. Jinye Zhou et al (1986) presented a 
deterministic Chinese parsing methodology 
using formal semantics to combine syntactic and 
semantic analysis. However, most of the 
proposed approaches were realized on 
small-scale lexicon and rule base (usually 
thousands words and tens or hundreds rules). It 
is still an open issue whether these models will 
work on real texts containing various 
ungrammatical phenomena. A parser capable of 
handling real text should have not only large 
lexicon and big rule base, but also high 
robustness in coping with different kinds of 
ungrammatical phenomena. Therefore, it is 
important to design a grammar scheme which 
not only is capable of representing the unique 
grammar structures which are different with 
English, but also qualified of handling 
unrestricted text. 
Phrase structure scheme is usually used in 
English parsing models to represent sentence 
structures, but it is not convenient and not strong 
enough to express Chinese sentence by phrase 
structure in some occasions. For examples: 
Sentence-1 ~t~ff\]i~~. 
Fig. 1 phrase structure 
SOC 
Fig.2 dependency structure 
I This work was mainly done while the author visited Kodensha Ltd, Japan during 1996.-1999 
78 
Sentence-1 is a pivot sentence(~gd'f~3), i.e., 
"~1~" is not only the object of "i,W" butalso 
the subject of " Ik~" . But this phrase structure 
cannot indicate the relations clearly as shown in 
Fig.1. However, the grammar structure is 
clarified if it is represented in dependency 
structure (Fig. 2). Therefore, it is believed that 
dependency grammar sebeme is more suitable 
than phrase structure to represent Chinese 
structures (Zhou, Huang, 1994). However, 
traditional Dependency gammar realizes the 
dependency relations between any of two 
specific words, then numerous word based 
dependency knowledge should be constructed, 
this is a time-consuming task. Fortunately, 
knowledge for phrase structure parsing has been 
accumulated for Chinese for many years and it 
should be re-used to compensate the lack of 
knowledge of word-based dependency parsing. 
Therefore, to combine the advantages of phrase 
structure parsing and dependency parsing, we 
propose a new parsing strategy, called 
"block-based dependency parsing". 
A "block" means a basic component of 
sentence, for example, there are six blocks for 
sentence 1: 
\] 
Another example: 
Sentence 2: ~:~_12q::~IJ~Af\]i~ ~ ~ ~n~ 
Blocks:\[If~.J2q:\] \[~t~'ff\]\] \[~\] \[~l~:~:~t~:\] 
\] 
A block represents an information unit in 
communications. For example, in 
Chinese-Japanese machine translation, 
translations of the members within a block in a 
Chinese sentence usually are in a same blocks in 
the Japanese translation. Furthermore, it is clear 
to represent block with phrase structure, while it 
is rather complicated with dependency structure. 
This block-based dependency parsing process 
works like follows. For an input sentence, basic 
components of sentence, i.e., "blocks" are first 
identified by an ATN-like partial parsing 
procedure, which produces a clear skeleton of 
the sentence structure. In our phrase structure 
analysis, we don't try to deduce the whole 
sentence into root S, instead, we only try to get 
the components, namely blocks. This partial 
parsing strategy guarantees high robustness. 
Then dependency parsing is applied in order to 
build dependency relations among blocks. The 
dependency parsing skips ungrammatical 
portions it encounters. This strategy confines 
ungrammatical portion and avoids errors to be 
• propagated globally. By partial parsing and skip 
strategy, this parser can handle long, 
complicated, or even faulty sentences. The 
experiments show that this parser is very robust 
and powerful. A parser constructed based on this 
approach has been developed, with 220,000 
words, 5,000 part-of-speech tagging rules, over 
1,000 block parsing rules and 300 dependency 
parsing rules. This parser has been applied in a 
Chinese-Japanese machine translation product 
(Zhou, 1999). To the author's knowledge, this 
parser is one of the largest scale Chinese parser 
ever implemented in the world. 
The outline of this paper is as follows. In 
section 1, we present our special solution to 
part-of-speech tagging which significantly 
affects the Chinese parsing. Section 2 describes 
in details the block-based dependency parsing 
approach. We then explain the dependency 
parsing algorithm in section 3. The experiment 
and its analysis are given in section 4. The 
conclusion is given in section 5. 
1 Rule-based part-of-speech tagging 
The Chinese  has many special 
syntactic phenomena substantially different from 
English (Chao, 1981; Huang, 1982, Wu and Hou, 
1982). One of the biggest problems is that there 
is no morphological change for a verb, whether 
the verb functions as the predicate, subject, 
object, or modifier of a noun. For instance: 
vhv r,rl 
(\[i igCN  V\] NP) 
Chinese linguistics literature insists that those 
words are verbs, and should be marked as "V', 
regardless of what context they are in. In this 
sense, there will be phrase structure rules for 
noun phrase like: 
NP->N+V 
NP->V+N 
79 
However, there must be some rules for VP, S 
like: 
S->N+V 
VP->V+N 
Therefore the Conflict of rules becomes very 
serious. It means that part-of-speech information 
in Chinese is too weak to support Chinese 
syntactical analysis. To solve this problem, we 
propose that in the part-of-speech tagging stage, 
the real grammar features of this kind of words 
are determined directly as N, instead of V. To do 
this, we describe all possible word category 
information for a word in the lexicon, for 
example: 
~ V/N/F/A 
//V: verb; N: noun; F: adverb; A: adjective 
A set of rules with comprehensive context 
constraints is designed to determine the specific 
part-of-speech of a word in a context. For 
example: 
1. X(NIR~J3(VINIFIA) + X(V*)->~(F) + X(V) 
2. ~+~(VINIFIA) + X(~)->~+~flq) + X(N) 
3. ~+~(VINIFIA ) + X(VFN)->~+~J(A ) + X(N) 
4. iE~+~J3(VINIFIA) + X(~V)->iE:~+~J3(V) 
X(NIR): a word X, whose word category may 
includes N,R, or others. 
R: Pronoun; 
* • any part-of-speeches; 
V* having V category; 
-V having no V category; 
It is ideal if we have a large corpus which has 
been tagged with thins kind of word category 
information, so that we can obtain tagging rules 
or obtained n-gram model by training. However, 
at present, we can't find a Chinese corpus tagged 
with this kind of part-of-speech information as 
the training data. We had to write the 
part-of-speech disambiguation rules manually. 
Currently, over 5,000 linguistics rules have been 
designed. 
2 Block-based Chinese dependency analysis 
As indicated in Fig. 3, block-based dependency 
analysis consists of four modules, i.e., word 
segmentation, part-of-speech tagging, block 
analysis and dependency analysis. A 
bi-directional heuristic longest matching method 
is applied to decide the optimal word sequence. 
A set of manually compiled linguistic rules is 
applied to decide the optimal word category 
sequence. In a partial parsing process, first, local 
structures (such as duplication, prefix and suffix) 
are identified by a set of word formation rules, 
and proper names are identified by a set of 
construction rules. This kind of local structures 
are called meta-blocks. Then frame structures 
(DP), which have paired starting word and 
ending word, such as "~'"'. "I~", "~'"'. 
"qa" ete are identified, but its internal 
structure analysis is delayed. Then ATN network 
is used to identify the basic blocks, called 
level-1 blocks (these blocks don't contain IP, LP 
and DP). Then we use a set of heuristic rules to 
identify the boundaries of IP and LP. Then ATN 
network will use again to identify the 
complicated blocks, called level-2 blocks, which 
may contain LP, DP, IP as its components. Then 
a sequence of blocks obtained is then transported 
to dependency parser, which will generate 
dependency relations among blocks. ARer that, 
we will recursively parse the internal parts of IP, 
Chinese 
Sentences lw°r  
segment 
-ation 
part-of- identifica- 
speech tion tagging 
I T~gesging 1 \[ ATN 
J Dependency 
~ depend- tree 
ency 
parsing 
\] \[ Depe21deenCy \] 
Fig. 3 Configuration of the block-based dependency parser 
80 
LP and DP to get its inner blocks and 
dependency relations. 
We define 11 kinds of blocks as explained 
below. 
NP 
UP 
UG 
NTL 
NTP 
AP 
FP 
VP 
IP 
LP 
DP 
Noun phr~e ~¢~ 
Digital phrase 14560,.~_=P--_~" 
Digital-classifier .:~.~,~-.p~: 
phrase 
Phrase expressing =.-\[-~t~z, 60 ./x~ 
the period of time 
Phrase expressing ~ ~}k ~1~.~_~ 
the exact time 
Adjective phrase ~gk:~Y 
Adverb phrase ~:~l:'~:J4~ (~ :~:i~) 
Verb phrase -~-~ \]i~ 
Preposition phrase ~l~l/~- 
Post-position "l~r~ 
phrase 
Frame structure ~ ..6..~ ~ ~ ~lJ ~i~ Jl~ .~.,,:i~ ~q" 
Table 1 Blocks defined in the system 
Except PP, LP and DP, each kind of block is 
defined by a set of rules in the form of phrase 
structure rule. All of these rules combined with 
syntactic and semantic constraints are 
implemented as an ATN network (Allen, 1995). 
We also define 17 kinds of dependency 
relations for Chinese as shown in table 2. 
1 !SUB Subject(LiB) 
20BJ1 Indireet-objeet~t~ ~) 
3 OBJ2 Direct objeet(~i~) 
4 COMP Complement(~b~) 
4 NUM Amount(~5~) 
5 TOP Topic(~J~) 
6 ADV'N Near adverbs ( ~| ~:~o~i~) 
7 ADVF Far adverbs (~'l"~.i~. ~'~.~i~. ~~) 
8 QT miscellaneous before verbs (~Ji*~:~_. ~-~I ~~) 
lO HT miscellaneous after verbs(Y~Ji,~..~l~J 
11 PUNC Punctuation mark(~/~gj~-~) 
12 PIVT Pivot(~) 
13 SOC Pivot-complcment(~b~) 
14 VAA Series of verbs after(~)~ ~-tJt~V~tJ/(l~) 
15 VAB Series of verbs before(~J~-~ffL~.t~fl ~) 
16 G ~ ~ ~JJ i~q ~t~ fJ~ ~Y~ ~JJ~ 
17 LOG Logical relation between sentences(tiE 
~. I~:) 
Table 2 Dependency relations used in the system 
For an Input: S = w~, w2,---, w,,, the expected 
parse result includes two parts as described 
below: 
(!) T : a set of sub-trees, each sub-tree 
represents a block. 
T={ Ti,T2,T3 ..... In } 
(~) D: a set of 3-tuple in the form of {governor, 
dependant, dependency-relation}, which 
represents dependency relations between blocks. 
D={ < go~,de11,relq >,< gov 2, dep2 , reIa 2 >,... 
< gov., dep., rela m >} 
Algorithm 1: The block-based parsing 
algorithm 
1) Identification DP by matching the starting word 
and ending word; 
2)Identification of meta-blocks by bottom-up 
analysis; 
3) Identification of NP, UP, UG, NTL, NTP, AP, 
FP, VP of level 1 by bottom-up analysis; 
4)Identification LP, PP by looking for left 
boundary for LP and right boundary for IP, by 
using a set of Chinese linguistic rules; 
5) Identification of NP, UP, UG, NTL, NTP, AP, 
FP, VP of level 2 by bottom-up analysis; 
6) Dependency parsing with the blocks identified; 
7) For blocks LP, DP and LP, recursively do 1 
thorough 6. 
In the following, we will illustrate the parsing 
process with an example. 
Sentence 3: :~51q~Y~;~J'qa/J~J~ 
(1)Word Segmentation & Part-of-speech 
tagging 
/~ v/~ F/~ V/.ff~ N/~ N/~ ~/q~,b 
~-~ FI_i~ AJ. P~ 
(2) Meta-blocks identification 
\[/~ v/:~ F/:~ v/we 
(3) Frame structure identification 
_yz NIJt~~ L/\]DP 
(4) Block identification 
blockl: \[/~.~ V/;T z FI~ V/\]VP 
81 
block2:\[/~a N/~.~ NONP 
block3: \[~ v~/J, A/~_.~ N/~ N/~ V/_~ 
~N/,zJC-~ z N/~t~i~,q L/\]DP 
block4: \[~ F/~ A/lAP 
block5: \[o P/\] 
(5)Predicate Identification 
Block4 is determined as the predicate. 
(6) Dependency parsing 
(block2, blockl, OBJ1) 
(block3, block4, ADVF) 
(block1, block4, SUB) 
(block5, bolck4, PUNC) 
(7)Repeat the above parsing process to 
analyze the internal structure of DP, IP and 
LP 
Analyze block3 recursively (The detailed 
process is omitted). 
Lots of efforts have been made to parse 
s into phrase structure, and many 
powerful computational models have been 
proposed (Gazdar, et al, 1987, Tomita, M, 1986). 
We build up an ATN like network to identifFy 
these blocks. Since the ATN approaches can be 
found in the literatures (Allen, 1995), we will 
not describe this algorithm in details here. In the 
next section, we will focus on a new efficient 
algorithm for Chinese dependency parsing. 
3 Dependency analysis 
Text For an Input: S = blockl, block2,...,block,, 
the dependency parsing will generate a set of 
3-tuple in the form of {governor, dependant, 
dependency-relation}, which represents 
dependency relations between blocks in the 
given sentence. 
{< gov 1 , depl , rel% >, 
<gov 2 , dep2 ,rela 2> .... < gov. , dep, , rela m >} 
Algorithm 2: The dependency parsing 
1) Count the number of block qualifying of 
acting as a predicate, denoted as s. These kind of 
blocks are called "predicate candidates". 
2) Decide the predicate from these s blocks, 
denoted as blockj 
3) If s=0, return; //need not analysis; 
4)For any case of S, S=1,2,... (S>0), 
dependency parsing respectively; 
do 
A sentence may contain s predicate candidates. 
For each case, we defined a detailed analysis 
algorithm. Up to now, the parser is designed to 
have ability to treat with sentences containing up 
to 7 predicate candidates. In case a sentence has 
more than 7 predicate candidates, it will be 
partitioned into two parts, and then doing 
analysis in turn. 
Suppose the predicate block is blockj, the 
number of "predicate candidates" is denoted as s. 
We explain the dependency parsing by the 
following two simple cases. 
Case 1: s=l 
• For all block k before blockj, builds 
dependency relations of 
(block k , block~,suB),( block k , block i,ADv), 
( block k , blocks ,G), ( block k , blockj ,TOP); 
• For all block k after blocks ,builds 
dependency relations of 
( block k , blocks ,COMP),( block k , block i ,OBJ 
l), ( block k , block~ ,OBJ2), etc. 
Case 2: s=2, Let's say the another predicate 
candidate is blockj 
• For all block k before blockj ,builds 
dependency relations of 
( block k , block~ ,suB), ( block k , blocks ,Ape), 
(block k ,blockt,G), (block k ,blockj ,TOP); 
For all block k after blockj ,builds 
dependency relations of 
( block k , blockj ,COMP), 
( block k , blockj ,osJD, 
(block k ,blockj ,OBJ2), etc. 
• For blocks between block i and blockj, 
Conducts detailed analysis based on the 
verb categories of block i and blocky 
• Determines the dependency relation 
between blockj and blockj 
82 
4 Experiments 
A parsing system was implemented and 
extensive experiments have been performed. 
The system is written in C and tested on 
Pentium PC. A total of over 1,000 phrase 
structure rules and over 3,00 dependency rules 
were used for block-based parsing. We built a 
large lexicon of 220,000 word entries, with word 
category information and necessary syntactical 
and semantic features. This approach has been 
incorporated as Chinese parsing model in a 
successful commercial Chinese-Japanese 
machine translation system J-Beijing (Zhou, 
1999). 
This system accepts Chinese text and output 
the parsing result for each sentence. Each input 
sentence is defined as a word string ending with 
period, comma, question mark, semicolon, 
exclamation mark. 
We evaluated the parsing result with two 
corpus: (~) "primary school textbook of 
Singapore"(~:~l~J~:Jx~), a corpus consists 
of single sentences of modern Chinese, 
including 1842 sentences, which not only covers 
most Chinese sentence types, but also includes 
various of morphological phenomena, such as 
word duplication, affix, suffix, etc. (~)Some 
news articles collected from People's 
Daily(1998,1999,2000). The sentences are real 
text, so there are lots of unknown words (mainly 
proper nouns), long sentences, complicated 
sentences, ellipsis, etc. The evaluation results are 
listed in table 3. 
Test 
corpus 
Primary 
school 
textbook, 
Singapore 
People's 
Daily 
#sentence 
1842 
Average 
sentence 
length 
(words) 
7.34 
Analysis 
precision 
90.4% 
1400 14.52 67.7% 
Table 3 Evaluation result 
Although this model has produced 
satisfactory initial results, some natural 
difficulties for the Chinese  still remain, 
such that further improvement is highly desired. 
Through mistake analysis, we found that some 
of main issues affecting the system performance 
seriously, as is listed below. 
1) Word segmentation 
/~/,~fgJ~/_k~/o / 
/~.~,~/~/~/_1=~/o / 
"~l~" can not only function as single word, but 
also function as two words with totally different 
meaning. 
2) Part-of-speech tagging 
(1,~,R) (2,~,I) (3,/0IT,R) (4,~J,I) (5.J~.N) (6,~1~ 
~'~.-~,N) (7,~)~,E) (8,fF,V) (9,I~J,E) (10,4~7~,F) (H, 
~.7~.F) (12,~,V,P) (13,~,N) (14,~,V,C) 
(15. ,P) 
3) Compound noun 
Since compound nouns cannot exhaustively 
numerated, errors will be inevitable. 
4) Identification of proper noun 
(1,~TzI~J~\],N) (2,~,V,P) C3,~.A) :4...-~.,N) (5,~i 
~,v,c) (6,~,~ (7,~,~ (8,,~,~,~ 
5) Syntactical ambiguity 
(l,~i~.,V,P) (2,an,N) (3,\]~I-----~,N) (4,~{~;,N) (5. 
//~.V.C) (6jz-~.A.D) (6,'~)~,N) (7,~-~,A) (8,0(f.-~,N) 
(9,~-~,A) (IO,~,N) (11,o ,P) 
For pattern of "V+A+N', there are usually 
two kinds of reduction methods: 
\[\[V+A\]vp+~ ~.A~/~ 
\[V+\[A+N\]np\] ~j~z~ 
All of these problems need further 
improvements in the future. 
Conclusion 
In this paper, a practical Chinese parser is 
presented. The block-based dependency parsing 
strategy is a novel integration of phrase structure 
partial approach and dependency parsing 
approach. The partial parsing approach and 
dependency parsing approach can cope with 
83 
ungrammatical or faulty, or complicated 
sentences, therefore making the system highly 
robust. Furthermore, our top-down strategy of 
identifying the Chinese special structures such 
as frame structures, preposition structures, 
post-preposition structures produces a simplified 
sentence skeleton, thereby improving the 
efficiency of parsing. 
Although this model has shown satisfactory 
initial results, some natural difficulties for the 
Chinese  still remain, and further work 
will be needed. We currently determine the word 
category by a set of linguistics rules compiled by 
human which limits the precision of 
identification precision. Therefore, other 
approaches such as statistical approach or some 
kind of hybrid approach will be adopted in the 
future. In addition, new methods in handling 
ambiguous word segmentation, proper noun and 
compound noun identification, block analysis, 
predicate identification and dependency analysis 
will be studied. 
Acknowledgements 
Our thanks go to Dr. Kai-Fu Lee and Prof. 
Changning Huang of Microsoft Research China 
for their valuable suggestions. Also thanks all 
the members of Chinese-Japanese MT group of 
Kodensha for their great efforts in testing the 
parsing system and improving the dictionary. 

References 
Gazdar, G.,Franz.,A., Osborne, K., and Evans, R. 
(1987), Natural Language Processing in the 
1980s.', CSLI, Stanford University. 
Tomita, M. (1986). Efficient Parsing for Natural 
Language: A Fast Algorithm for Practical Systems, 
Boston: Kluwer. 
Jinye Zhou, Shi-kuo Chang (1986), A Methodology 
for Deterministic Chinese Parsing, Computer 
Processing of Chinese & Oriental Languages, Vol. 
2, No. 3 May 1986. 
Lin-Shan Lee, Lee-Feng Chien, Longj-ji Lin, James 
Huang, K.-J. Chen (1991), An Efficient Natural 
Language Processing System Specially Designed 
for the Chinese Language, Computational 
Linguistics, Vol.17, No. 4, 1991 
M. Zhou (1999), J-Beijing Chinese-Japanese 
Machine Translation System, Proceedings of JSCL, 
312-319, Beijing, 1-3, Nov, 1999 
Jingcun Wu, Xuechao Hou, Modem Chinese 
Syntactical Analysis, Beijing University Press, 
1982. 
Zhengsheng Luo, Changiian Sun, Cai Sun (1995), An 
Approach to the Recognition of predicated in the 
automatic analysis of Chinese sentence patterns, 
Advances and applications on Computational 
Linguistics, Tsinghua University Press 
Chao, Y.R. 0968). A Grammar of Spoken Chinese, 
Berkeley, CA: University of California Press 
M. Zhou, C.N.Huang, (1994) An Efficient Syntactic 
Tagging Toll for Corpora. Proc. COLING 94, 
Kyoto, pp. 945-955. 
Huang, J. (1982). Logical relations in Chinese and 
the theory of grammar, Doctoral dissertation, 
Massachusetts Institute of Technology, Cambridge, 
MA. 
