Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 331–336,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Improving English Subcategorization Acquisition with Diathesis Al-
ternations as Heuristic Information 
Xiwu Han 
Institute of Computational 
Linguistics 
Heilongjiang University 
Harbin City 150080 China 
hxw@hlju.edu.cn 
Tiejun Zhao 
School of Computer Science and 
Technology 
Harbin Institute of Technology 
Harbin City 150001 China 
tjzhao@mtlab.hit.edu.cn
Xingshang Fu 
Institute of Computational 
Linguistics 
Heilongjiang University 
Harbin City 150080 China
fxs@hlju.edu.cn 
 
 
 
Abstract 
Automatically acquired lexicons with 
subcategorization information have al-
ready proved accurate and useful enough 
for some purposes but their accuracy still 
shows room for improvement. By means 
of diathesis alternation, this paper pro-
poses a new filtering method, which im-
proved the performance of Korhonen’s 
acquisition system remarkably, with the 
precision increased to 91.18% and recall 
unchanged, making the acquired lexicon 
much more practical for further manual 
proofreading and other NLP uses. 
1 Introduction 
Subcategorization is the process that further clas-
sifies a syntactic category into its subsets. Chom-
sky (1965) defines the function of strict subcate-
gorization features as appointing a set of con-
straints that dominate the selection of verbs and 
other arguments in deep structure. Large sub-
categorized verbal lexicons have proved to be 
crucially important for many tasks of natural 
language processing, such as probabilistic pars-
ers (Korhonen, 2001, 2002) and verb classifica-
tions (Schulte im Walde, 2002; Korhonen, 2003).  
Since Brent (1993) a considerable amount of re-
search focusing on large-scaled automatic acqui-
sition of subcategorization frames (SCF) has met 
with some success not only in English but also in 
many other languages, including German 
(Schulte im Walde, 2002), Spanish (Chrupala, 
2003), Czech (Sarkar and Zeman, 2000), Portu-
guese (Gamallo et. al, 2002), and Chinese (Han 
et al, 2004). The general objective of this re-
search is to acquire from a given corpus the SCF 
types and numbers for predicate verbs. Two typi-
cal steps during the process of automatic acquisi-
tion are hypothesis generation and selection. 
Usually based on heuristic rules, the first step 
generates SCF hypotheses for involved verbs; 
and the second selects reliable ones via statistical 
methods, such as BHT (binomial hypothesis test-
ing), LLR (log likelihood ratio) and MLE 
(maximum likelihood estimation). This second 
step is also called statistical filtering and has 
been widely regarded as problematic. English 
researchers have proposed some methods adjust-
ing the corpus hypothesis frequencies before or 
while filtering. These methods are often called 
backoff techniques for SCF acquisition. Some of 
them represent a remarkable improvement in the 
acquisition performance, for example diathesis 
alternation and semantic motivation (Korhonen, 
1998, 2001, 2002). 
For the convenience of comparison between 
performances of different SCF acquisition meth-
ods, we define absolute and relative recall in this 
paper. By absolute recall, we mean the figure 
computed against the background of input corpus, 
while relative recall is against the set of gener-
ated hypotheses.  
At present, automatically acquired verb lexi-
cons with SCF information have already proved 
accurate and useful enough for some NLP pur-
poses (Korhonen, 2001; Han et al, 2004). As for 
English, Korhonen (2002) reported that semanti-
cally motivated SCF acquisition achieved a pre-
cision of 87.1%, an absolute recall of 71.2% and 
a relative recall of 85.27%, thus making the ac-
quired lexicon much more accurate and useful. 
However, the accuracy still shows room for im-
provement, especially for those SCF hypotheses 
with low frequencies. Detailed analysis on the 
acquisition system and some resulting data 
shows that three main causes should account for 
the comparatively unsatisfactory performance: a. 
the imperfect hypothesis generator, b. the Zipfian 
331
distribution of syntactic patterns, c. the incom-
plete partition over SCF types of a given verb. 
The first problem mainly comes from the inade-
quate parsing performance and noises existing in 
the corpus, while the other two problems are in-
herent to natural languages and should be solved 
in terms of acquisition techniques particularly 
during the process of hypothesis selection. 
2 Related Work 
The empirical background of this paper is the 
public resource for subcategorization acquisition 
of English verbs, provided by Anna Korhonen 
(2005) in her personal home page. The data in-
clude 30 verbs, as shown in Table 1, and their 
unfiltered SCF hypotheses, which were auto-
matically generated via Briscoe and Carroll’s 
(1997) SCF acquisition system, and the manually 
established standard.  
Precision  + Recall 
2 * Precision * Recall 
|True positives|+|False positives|
|True positives| 
|True positives|+|False negatives|
|True positives| 
Table 1. English Verbs in Use. 
add agree attach 
bring carry carve 
chop cling clip 
fly  cut travel 
drag communicate give 
lend lock marry 
meet mix move 
offer provide visit 
push sail send 
slice supply swing 
For each verb, there is a corpus of 1000 sen-
tences extracted from the BNC, and all together 
42 SCF types are involved in the corpus. The 
framework of Briscoe and Carroll’s system con-
sists of six overall components, which are ap-
plied in sequence to sentences containing a spe-
cific predicate in order to retrieve a set of SCFs 
for that verb: 
null A tagger, a first-order Hidden Markov 
Model POS and punctuation tag disam-
biguator. 
null A lemmatizer, an enhanced version of the 
General Architecture for Text Engineering 
project stemmer. 
null A probabilistic LR parser, trained on a 
tree-bank derived semi-automatically from 
the SUSANNE corpus, returns ranked 
analyses using a feature-based unification 
grammar. 
null A pattern extractor, which extracts 
subcategorization patterns, i.e. local 
syntactic frames, including the syntactic 
frames, including the syntactic categories 
and head lemmas. 
null A pattern classifier, which assigns patterns 
to SCFs or rejects them as unclassifiable. 
null A SCF filter, which evaluates sets of SCFs 
gathered for a predicate verb. 
Nowadays, in most related researches, the per-
formances of subcategorization acquisition sys-
tems are often evaluated in terms of precision, 
recall and F measure of SCF types (Korhonen, 
2001, 2002). Generally, precision is the percent-
age of SCFs that the system proposes correctly, 
while recall is the percentage of SCFs in the gold 
standard that the system proposes: 
 
Precision = 
 
 
Recall =  
 
 
F-measure =  
 
Here, true positives are correct SCF types pro-
posed by the system, false positives are incorrect 
SCF types proposed by system, and false nega-
tives are correct SCF types not proposed by the 
system. 
3 The MLE Filtering Method 
The present SCF acquisition system for English 
verbs employs a MLE filter to test the automati-
cally generated SCF hypotheses. Due to noises 
accumulated while tagging, lemmatizing and 
parsing the corpus, even though correction is im-
plemented for some typical errors when classify-
ing the extracted patterns, the hypothesis genera-
tor does not perform as efficiently as hoped. 
Sampling analysis on the unfiltered hypotheses 
in Korhonen’s evaluation corpus indicates that 
about 74% incorrectly proposed and rejected 
SCF types come from the defects of the MLE 
filtering method. 
Performance of the MLE filter is closely re-
lated to the actual distributions p(scf
i
|v) over 
predicates and SCF types in the input corpus. 
First, from the overall corpus a training set is 
drawn randomly; it must be large enough to en-
sure a similar distribution. Then, the frequency 
of a subcategorization frame scf
i
 occurring with a 
verb v is recorded and used to estimate the prob-
ability p(scf
i
|v). Thirdly, an empirical threshold θ  
is determined, which ensures that a maximum 
332
