Automatic extraction of subcorpora based on subcategorization 
frames from a part-of-speech tagged corpus 
Susanne GAHL 
UC Berkeley, Department of Linguistics 
ICSI 
1947 Center St, Suite 600 
Berkeley, CA 94704-1105 
gahl@icsi.berkeley.edu 
Abstract 
This paper presents a method for extracting 
subcorpora documenting different subcate- 
gorization frames for verbs, nouns, and 
adjectives in the 100 mio. word British 
National Corpus. The extraction tool consists 
of a set of batch files for use with the Corpus 
Query Processor (CQP), which is part of the 
IMS corpus workbench (cf. Christ 1994a,b). 
A macroprocessor has been developed that 
allows the user to specify in a simple input file 
which subcorpora are to be created for a given 
lemma. 
The resulting subcorpora can be used (1) to 
provide evidence for the subcategorization 
properties of a given lemma, and to facilitate 
the selection of corpus lines for lexicographic 
research, and (2) to determine the frequencies 
of different syntactic contexts of each lemma. 
Introduction 
A number of resources are available for 
obtaining subcategorization information, i.e. 
information on the types of syntactic 
complements associated with valence-bearing 
predicators (which include verbs, nouns, and 
adjectives). This information, also referred to 
as valence information is available both in 
machine-readable form, as in the COMLEX 
database (Macleod et al. 1995), and in human- 
readable dictionaries (e.g. Hornby 1989, 
Procter 1978, Sinclair 1987). Increasingly, 
tools are also becoming available for acquiring 
subcategorization information from corpora, 
i.e. for inferring the subcategorization frames 
of a given lemma (e.g. Manning 1993). 
None of these resources provide immediate 
access to corpus evidence, nor do they provide 
information on the relative frequency of the 
patterns that are listed for a given lemma. 
There is a need for a tool that can (1) find 
evidence for subcategorization patterns and 
(2) determine their frequencies in large 
corpora: 
1. Statistical approaches to NLP rely on 
information not just on the range of 
combinatory possibilities of words, but 
also the relative frequencies of the 
expected patterns. 
2. Dictionaries that list subcategorization 
frames often list expected patterns, rather 
than actual ones. Lexicographers and 
lexicologist need access to the evidence 
for this information. 
3. Frequency information has come to be 
the focus of much psycholinguistic 
research on sentence processing (see for 
example MacDonald 1997). While 
information on word frequency is readily 
available (e.g. Francis and Kucera 
(1982)), there is as yet no easy way of 
obtaining information from large corpora 
on the relative frequency of complemen- 
tation patterns. 
None of these points argue against the use- 
fulness of the available resources, but they 
show that there is a gap in the available in- 
formation. 
To address this need, we have developed a tool 
for extracting evidence for subcategorization 
patterns from the 100 mio. word British 
National Corpus (BNC). The tool is used as 
pan of the lexicon-building process in the 
FrameNet project, an NSF-funded project 
aimed at creating a lexical database based on 
the principles of Frame Semantics (Fillmore 
1982). 
428 
1 Infrastructure 
1.1 Tools 
We are using the 100 mio. word British 
National Corpus, with the following corpus 
query tools: 
• CQP (Corpus Query Processor, Christ 
(1994)), a general corpus query processor 
for complex queries with any number and 
combination of annotated information 
types, including part-of-speech tags, 
morphosyntactic tags, lemmas and 
sentence boundaries. 
• A macroprocessor for use with CQP that 
allows the user to specify which 
subcorpora are to be created for a given 
lemma. 
The corpus queries are written in the CQP 
corpus query language, which uses regular 
expressions over part-of-speech tags, lemmas, 
morphosyntactic tags, and sentence 
boundaries. For details, see Christ (1994a). 
The queries essentially simulate a chunk 
parser, using a regular grammar. 
1.2 Coverage 
A list of the verb frames that are currently 
searchable is given in figure 1 below, along 
with an example of each pattern. The 
categories we are using are roughly based on 
those used in the COMLEX syntactic 
dictionary (Macleod et al. 1995). 
intransitive 'worms wiggle' 
np 'kiss me' 
np_np 'brought her flowers' 
np_pp 'replaced it with a new one' 
np_Pvping 'prevented him from leaving' 
np_pwh 'asked her about what it all 
meant' 
np_vpto 'advised her to go' 
np_vping 'kept them laughing' 
np_sfin 'told them (that) he was back' 
np_wh 'asked him where the money 
Was' 
np_ap 'considered him foolish' 
np_sbrst 'had him clean up' 
ap 'turned blue' 
figure 1: searchable 
In our queries for nouns and adjectives as 
targets, we are able to extract prepositional, 
clausal, infinitival, and gerundial complements. 
In addition, the tool accomodates searches for 
compounds and for possessor phrases (my 
neighbor's addiction to cake, my milk allergy). 
Even though these categories are not tied to 
the syntactic subcategorization frames of the 
target lemmas, they often instantiate semantic 
arguments, or, more specifically, Frame 
elements (Fillmore 1982, Baker et al. 
forthcoming). 
1.3 Method 
1.3.1 Overview 
We start by creating a subcorpus containing all 
concordance lines for a given lemma. We call 
this subcorpus a lemma-subcorpus. The 
extraction of smaller subcorpora from the 
lemma subcorpus then proceeds in two stages. 
During the first stage, syntactic patterns 
involving 'displaced' arguments (i.e. 'left 
isolation' or 'movement' phenomena) are 
extracted, such as passives, tough movement 
and constructions involving WH-extraction. 
The result of this procedure is a set of 
subcorpora that are homogeneous with respect 
to major constituent order. Following this, the 
remainder of the lemma-subcorpus is 
partitioned into subcorpora based on the 
subcategorization properties of the lemma in 
question. 
PP 
PP-PP 
Pvping 
Pwh 
intrans, part. 
np_particle 
particle_pp: 
particle_wh: 
vping 
sfin 
sbrst 
vpto 
directquote 
adverb 
complement 
'look at the picture' 
'turned from a frog into a 
prince' 
'responded by nodding her 
head' 
'wonder about how it 
happened' 
'touch down', 'turn over' 
'put the dishes away', 
'put away the dishes' 
'run off with it' 
'figured out how to get there' 
'needs fixing' 
'claimed (that) it was over' 
'demanded (that) he leave' 
'agreed to do it over' 
'no, said he', '"no", 'he said', 
'he said: "no"' 
'behave badly' 
types for verbs 
429 
1.3.2 Search strategies: positive and negative queries 
For the extraction of certain subcategorization 
patterns, it is not necessary to simulate a parse 
of all of the constituents. Where an explicit 
context cue exists, a partial parse suffices. For 
example, the query given in figure 2 below is 
used to find \[_ NP VPing\] patterns (e.g. kept 
them laughing). Note that the query does not 
positively identify a noun phrase in the 
~osition followin 
encoding 
\[$search_b,,\] 
\[pos!="V.*lCJC 
ICJSICJTIPRFIP 
RPIPUN"\] { 1,5} 
\[pos ="VVG 
IVBGIVDGIVH 
G"\] 
within s; 
verb. 
description 
target lemma 
gerund 
example 
kept 
them 
coming 
within a 
sentence 
figure 2: A query for \[NP VPing\] 
1.3.3Searches driven by subcategorization frames 
Applying queries like the one for \[NP VPing\] 
"blindly", i.e. in the absence of any 
information on the target lemma, would 
produce many false hits, since the query also 
matches gerunds that are not subcategorized. 
However, the information that the target verb 
subcategorizes for a gerund dramatically 
reduces the number of such errors. 
The same mechanism is used for addressing 
the problems associated with prepositional 
phrase attachment. The general principle is 
that prepositional phrases in certain contexts 
are considered to be embedded in a preceding 
noun phrase , unless the user specifies that a 
given preposition is subcategorized for by the 
target lemma. For example, the of-phrase in a 
sequence Verb - NP - of- NP is interpreted as 
part of the first NP (as in met the president of 
the company), unless we are dealing with a 
verb that has a \[_NP PPof\] subcategorization 
frame, e.g. cured the president of his asthma. 
1.3.4 Cascading queries 
The result of each query is subtracted from the 
lemma subcorpus and the remainder submitted 
to the next set of queries. As a result, earlier 
queries pre-empt later queries. For example, 
concordance lines matching the queries for 
passives, e.g. he was cured of his asthma are 
filtered out early on in the process, so as to 
avoid getting matched by the queries dealing 
with (active intransitive) verb + prepositional 
phrase complements, such as he boasted of his 
achievements. 
Another example of this type of preemption 
concerns the interaction of the query for 
ditransitive frames (brought her flowers) with 
later queries for NP complements. A proper 
name immediately followed by another 
proper name (e.g. Henry James) is interpreted 
as a single noun phrase except when the target 
lemma subcategorizes for a ditransitive frame t. 
An analogous strategy is used for identifying 
noun compounds. For ditransitives, strings that 
represent two consecutive noun phrases are 
queried for first. Note that this method 
crucially relies on the fact that the 
subcategorization properties of the target 
lemma are given as the input to the query 
process. 
2 Examples 
2.1 NPs 
An example of a complex query expression of 
the kind we are using is given in figure 3. The 
expression matches noun phrases like "the 
three kittens", "poor Mr. Smith", "all three", 
"blue flowers", "an unusually large hat", etc. 
(\[pos = "AT01CRDIDPSIDT0IORDICJT- 
DT0\[CRD-PNI"\]* \[pos = "AV01AJ0- 
AV0"\]* \[pos = "AJ01AJCIAJSIAJ0- 
AV01AJ0-NN 11AJ0-VVG"\]* \[pos="NN0 
INN 11NN21AJ0-NN1 INN 1-NP01NN 1- 
VVBINN 1VVGINN2-VVZ"\]) 
I(\[pos = "AT01CRDIDPSIDT01ORDICJT- 
DT01CRD-PNI"\]+ \[pos = "AV01AJ0- 
AV0"\]* \[pos = "AJ01AJCIAJSIAJ0- 
AV01AJ0-NNllAJ0-VVG"\]+)I (\[pos = 
"AT01CRDIDPSIDT01ORDICJT-DT01 
CRD-PNI"\]* \[pos = "AV01AJ0-AV0"\]* 
\[pos = "AJ01AJCIAJSIAJ0-AV01AJ0- 
NNllAJ0-VVG"\]* \[pos = "NP01NN1- 
NP0"\]+)l(\[pos = "AJ01AJCIAJS"\]* \[pos 
= "PNIIPNPIPNXICRD-PNI"\]) 
figure 3. A regular expression matching NPs 
2.2 Coordinated passives 
As an example of a query matching a 
'movement' structure, consider the query for 
coordinated passives, given in figure 3 below. 
The leftmost column gives the query 
expression itself, while the other columns show 
i Inevitably, this strategy fails in some cases, such as 
"I'm reading Henry James now" (vs. "I read Henry 
stories." 
430 
concordance lines found by this query. The 
\[0mmm = 'beibeinglge0 \[(class ~ '~'}! (class= '~"a 
& (v, ord ~= "s') & (pos I= pos = 'l~dQ ') I (~ord = 
'NNIlNN2')\] ":)\]{0,41 
\[po~'VVNIVVI\]VVD- 
VVNIAD-VVN1AD.D- 
VVD'\] \[po~"AVP1? 
\[(((pos = 'l~tJq') I (v~ord = 
"3) a (c~s = "c')) I (dass 
'~')\]{o3} 
been 
be 
be 
be 
~l.Iq {i ttl #1 f~ri 
prevented 
managed 
treated 
tgure 4. A query 
3 The macroprocessor 
A macroprocessor has been developed 2 
that allows the user to specify in a simple 
input file which subcorpora are to be 
created for a given lemma. 
The macroprocessor reads the the number 
target lemma is 
\[word='br'lword='~md'lw 
~d='buflv~nd=";Iv~i= 
'~ah~ ~an'1~on~='~\] 
\[~os~'VVNVV~VB~ 
VB~VBGVB~VB~VB 
~VDI~VDE~VDGVD\] 
VDN~VDr~VH~VHI~V 
HGVH~VH~VH~VM 
0{VV~VVGVV\]\]VV-Z\]A 
T~DI~DT~DTQPNDP 
NBm~'~ (ms = '~Q" 
& ~ord = ".* e~)\]{03 } 
but not 
or largely 
and often 
for it and 
passives in 
the verb cure: 
\[kmmaa = "sere" & 
po~"VVBIVVDIVVGIV 
VIIVVN1VVZ\]AD- 
VVNIAJ01VVDI AD- 
WGINN1-VVB INN1- 
VVGINN2-VV 7\]VVD- 
VVN"& pos = "VVN" & 
pos ~ "A~'l\[pos~ 
"Af01AJOAISIAT01CRI\] DI~DT~DTONNONN 
11NN21NF01ORDlt~I\]PN r~qr'r~vvavvD'l 
cured 
cured 
cured 
cured 
structures 
of matches for each subcategorization 
pattern into an output file. A sample input 
file for the lemma insist is given in figure 5 
below. 
lemma: insist 
CQP Search Definition 
search_by: lemma 
POS: verb 
np: (y/n) n 
np_np: (y/n) n 
np_ap: (y/n) n 
np_p.p: (_list_ prepositions) none 
np_pmg: (_list_ prepositions) none 
np_pwh: (_list_ prepositions) none 
np_vpto: (y/n) n 
np_vping: (y/n) n 
np_sfin: (y/n) n 
np_wh: (y/n) n 
np_sbrst: (y/n) n 
save_text: no 
save_binary: yes 
p.p: (_list_ prepositions) on 
ping: (_list_ prepositions) on 
pwh: (_list_ prepositions) on 
particle: (y/n) n 
np_particle: (y/n) n 
particle_pp: {y/n) n 
particle_wh: (y'n) n 
ap: (y/n) n 
directquote: (y/n) y 
sfin (y/n) y 
sbrst: (y/n) y 
figure 5 Input form for macroporcessor 
4 Output format 
sorted, usually by the head of the first 
complement following the target lemma. 
The subcorpora can be saved as binary files 
for further processing in CQP or XKWIC, 
an interactive corpus query tool (Christ 
1994) and as text files. The text files are 
5 Limitations of the approach 
Our tool relies on subcategorization informa- 
tion as its input. Hence it is not capable of 
automatically learning subcategorization 
frames, e.g. ones that are missing in diction- 
2 Our macroprocessor was developed by Collin Baker (ICSI-Berkeley) and Douglas Roland (U of Colorado, Boulder). 
431 
aries or omitted in the input file. The tool 
facilitates the (manual) discovery of evidence 
for new subcategorization frames, however, as 
potential complement patterns are saved in 
separate subcorpora. Indeed, this is one of the 
ways in which the tool is being used in the 
context of the FrameNet project. 
Some of the technical limitations of the exist- 
ing tools result from the fact that we are 
working with an unparsed corpus. Thus, many 
types of 'null' or 'empty' constituents 3 are not 
recognized by the queries. Ambiguities in 
prepositional phrase attachment are another 
major source of errors. For instance, of the 
concordance lines supposedly instantiating a 
\[_NP PPwith\] frame for the verb heal, several 
in fact contained embedded PPs (e.g. \[_NP\], as 
in heal \[children with asthma\], rather than 
\[_NP PPwith\], as in healing \[arthritis\] \[with a 
crystal ball\]), 
Finally, the search results can only be as accu- 
rate as the part-of-speech tags and other an- 
notations in the corpus. 
7 Future directions 
Future versions of the tool will include 
searches for predicative (vs. attributive) uses 
for adjectives and nouns. For verbs, the 
searches will be expanded to cover the entire 
set of complementation patterns described in 
the COMLEX syntactic dictionary. 
Conclusion 
We have presented an overview of a set of tools 
for extracting corpus lines illustrating subcate- 
gorization patterns of nouns, verbs, and adjec- 
tives, and for determining the frequency of 
these patterns. The tools are currently used as 
part of the FrameNet project. An overview of 
the whole project can be found at: 
http://www.icsi.berkeley.edu/~framenet. 
Acknowledgements 
This work grew out of an extremely enjoyable 
collaborative effort with Dr. Ulrich Heid of 
IMS Stuttgart and Dan Jurafsky of the 
University of Boulder, Colorado. I would like 
to thank Doug Roland and especially the 
untiring Collin Baker for their work on the 
macroprocessor. I would also like to thank the 
members of the FrameNet project for their 
comments and suggestions. I thank Judith 
Eckle-Kohler of IMS-Stuttgart, JB Lowe of 
ICSI-Berkeley and Dan Jurafsky for com- 
ments on an earlier draft of this paper. 

References 
Baker, C. F., Fillmore, C. J. and Lowe, J. B 
(forthcoming). The Berkeley FrameNet project. 
Proceedings of the 1998 ACL-COLING conference. 
Christ, O. (1994a) The IMS Corpus Workbench 
Technical Manual. Institut ffir maschinelle 
Sprachverarbeitung, Universit~t Stuttgart. 
Christ, O. (1994b) The XKwic User Manual. Institut 
fur maschinelle Sprachverarbeitung, Universit~t 
Stuttgart. 
Fillmore, C. J. (1982) Frame Semantics. In 
"Linguistics in the morning calm", Hanshin 
Publishing Co., Seoul, South Korea, 11 !-137. 
Francis, W. N. and Kucera, H. (1982) Frexluency 
Analysis of English Usage: Lexicon and Grammar, 
Houghton Mifflin, Boston, MA. 
Hornby, A. S. (1989) Oxford Advanced Learner's 
Dictionary of Current English. 4th edition. Oxford 
University Press, Oxford, England. 
MacDonald, M. C. (ed.) (1997) Lexical Representa- 
tions and Sentence Processing. Erlbaum Taylor & 
Francis. 
Macleod, C. and Grishman, R. (1995) COMLEX 
Syntax Reference Manual. Linguistic Data 
Consortium, U. of Pennsylvania. 
Manning, Christopher D. (1993). Automatic Acquisi- 
tion of a large subcategorization dictionary from 
corpora. Proceedings of the 31st ACL, pp. 235- 
242. 
Procter, P. (ed.). (1989) Longman Dictionary of 
Contemporary English. Longman, Burnt Mill, 
Harlow, Essex, England. 
Sinclair, J. M. (1987) Collins Cobuild English 
Language Dictionary. Collins, London, England. 
3 Our system is able to identify passive structures, 
tough-movement, and a number of common left 
isolation constructions, i.e. constructions involving 
'traces' or movement sites. 
