Subcategorization Acquisition and Evaluation for Chinese Verbs 
Xiwu Han, Tiejun Zhao, Haoliang Qi, Hao Yu 
Department of Computer Science,  
Harbin Institute of Technology, 150001 Harbin, China 
{hxw, tjzhao, qhl, yh}@mtlab.hit.edu.cn 
 
Abstract 
This paper describes the technology and an ex-
periment of subcategorization acquisition for 
Chinese verbs. The SCF hypotheses are gener-
ated by means of linguistic heuristic information 
and filtered via statistical methods. Evaluation 
on the acquisition of 20 multi-pattern verbs 
shows that our experiment achieved the similar 
precision and recall with former researches. Be-
sides, simple application of the acquired lexicon 
to a PCFG parser indicates great potentialities of 
subcategorization information in the fields of 
NLP. 
Credits 
This research is sponsored by National Natural 
Science Foundation (Grant No. 60373101 and 
603750 19), and High-Tech Research and Devel-
opment Program (Grant No. 2002AA117010-09). 
Introduction 
Since (Brent 1991) there have been a consider-
able amount of researches focusing on verb lexi-
cons with respective subcategorization informa-
tion specified both in the field of traditional lin-
guistics and that of computational linguistics. As 
for the former, subcategory theories illustrating 
the syntactic behaviors of verbal predicates are 
now much more systemically improved, e.g. 
(Korhonen 2001). And for auto-acquisition and 
relevant application, researchers have made great 
achievements not only in English, e.g. (Briscoe 
and Carroll 1997), (Korhonen 2003), but also in 
many other languages, such as Germany (Schulte 
im Walde 2002), Czech (Sarkar and Zeman 
2000), and Portuguese (Gamallo et. al 2002). 
However, relevant theoretical researches on 
Chinese verbs are generally limited to case gram-
mar, valency, some semantic computation theo-
ries, and a few papers on manual acquisition or 
prescriptive designment of syntactic patterns. 
Due to irrelevant initial motivations, syntactic 
and semantic generalizabilities of the consequent 
outputs are not in such a harmony that satisfies 
the description granularity for SCF (Han and 
Zhao 2004). The only auto-acquisition work for 
Chinese SCF made by (Han and Zhao 2004) de-
scribes the predefinition of 152 general frames 
for all verbs in Chinese, but that experiment is 
not based on real corpus. After observing and 
analyzing quantity of subcategory phenomena in 
real Chinese corpus in the People’s Daily 
(Jan.~June, 1998), we removed from Han & 
Zhao’s predefinition 15 SCFs that are actually 
similar derivants of others, and then with this 
foundation and linguistic rules from (Zhao 2002) 
as heuristic information we generated SCF hy-
potheses from the corpus of People’s Daily 
(Jan.~June, 1998), and statistically filtered the 
hypotheses into a Chinese verb SCF lexicon. As 
far as we know, this is the first attempt of Chi-
nese SCF auto-acquisition based on real corpus. 
In the rest of this paper, the second section de-
scribes a comprehensive system that builds verb 
SCF lexicons from large real corpus, the respec-
tive operating principles, and the knowledge 
coded in our SCF. The third section analyzed the 
acquired lexicon with two experiments: one 
evaluated the acquisition results of 20 verbs with 
multi syntactic patterns against manual gold 
standard; the other checked the performance of 
the lexicon when applied in a PCFG parser. The 
forth section compares and contrasts this research 
with related works done by others. And at last, 
Section 5 concludes our present achievements, 
disadvantages and possible future focuses. 
 
1   SCF Acquisition 
1.1 The Acquisition Method 
There are generally 4 steps in the process of our 
auto-acquisition experiment. First, the corpus is 
processed with a cascaded HMM parser; second, 
