Automatic Extraction of Subcategorization Frames for Czech* 
Anoop Sarkar 
CIS Dept, Univ of Pennsylvania 
200 South 33rd Street, 
Philadelphia, PA 19104 USA 
anoop@linc, cis. upenn, edu 
Daniel Zeman 
0stay formflnf a aplikovand lingvistiky 
Univerzita Karlova 
Praha, Czechia 
zeman@ufal .mff. cuni. cz 
Abstract 
We present some novel nmchine learning techniques 
for the identilication of subcategorization infornm- 
tion for verbs in Czech. We compare three different 
statistical techniques applied to this problem. We 
show how the learning algorithm can be used to dis- 
cover previously unknown subcategorization frames 
from the Czech Prague 1)ependency Treebank. The 
algorithm can then be used to label dependents of 
a verb in the Czech treebank as either arguments 
or adjuncts. Using our techniques, we are able to 
achieve 88% precision on unseen parsed text. 
1 Introduction 
Tl-te subcategorization of verbs is an essential is- 
sue in parsing, because it helps disambiguate the 
attachment of arguments and recover the correct 
predicate-argument relations by a parser. (CmToll 
and Minnen, 1998; CmToll and Rooth, 1998) give 
several reasons why subcategorization information 
is important for a natural  parser. Machine- 
readable dictionaries are not comprehensive enough 
to provide this lexical infornaation (Manning, 1993; 
Briscoe and Carroll, 1997). Furthermore, such dic- 
tionaries are available only for very few s. 
We need some general method for the automatic ex- 
traction of subcategorization information from text 
corpora. 
Several techniques and results have been reported 
on learning subcategorization frames (SFs) from 
text corpora (Webster and Marcus, 1989; Brent, 
1991; Brent, 1993; Brent, 1994; Ushioda et al., 
1993; Manning, 1993; Ersan and Charniak, 1996; 
Briscoe and Carroll, 1997; Carroll and Minnen, 
1998; Carroll and Rooth, 1998). All of this work 
" Tiffs work was done during the second author's visit to tl~e 
University of Pennsylvania. We would like to thank Prof. Ar- 
avind Joshi, l)avid Chiang, Mark l)ras and the anonymous re- 
viewers for their comments. The first at,thor's work is partially 
supported by NS F Grant S BR 8920230. Many tools used in this 
work are the resuhs of project No. VS96151 of the Ministry of 
Education of the Czech Republic. The data (PDT) is thanks 
to grant No. 405/96/K214 of the Grant Agency of the Czech 
Republic. Both grants were given to the Institute of Fornml 
and Applied linguistics, Faculty of Mathenmtics and Physics, 
Charles University, Prague. 
deals with English. In this paper we report on 
techniques that automatically extract SFs for Czech, 
which is a flee word-order , where verb 
complements have visible case marking.I 
Apart from the choice of target , this 
work also differs from previous work in other ways. 
Unlike all other previous work in this area, we do 
not assume that the set of SFs is known to us in ad- 
vance. Also in contrast, we work with syntactically 
annotated data (the Prague Dependency Treebank, 
PDT (HajiC 1998)) where the subcategorization in- 
formation is not given; although this might be con° 
sidered a simpler problem as compared to using raw 
text, we have discovered interesting problems that a 
user of a raw or tagged corpus is unlikely to face. 
We first give a detailed description of the task 
of uncovering SFs and also point out those prop- 
erties of Czech that have to be taken into account 
when searching lbr SFs. Then we discuss some dif-. 
ferences fl'Oln the other research efforts. We then 
present the three techniques that we use to learn SFs 
from the input data. 
In the input data, many observed dependents of 
the verb are adjuncts. To treat this problem effec- 
tively, we describe a novel addition to the hypoth- 
esis testing technique that uses subset of observed 
fl'ames to permit the learning algorithm to better dis- 
tinguish arguments fl-om adjtmcts. 
Using our techniques, we arc able to achieve 88% 
precision in distinguishing argunaents from adjuncts 
on unseen parsed text. 
2 lhsk Description 
In this section we describe precisely the proposed 
task. We also describe the input training material 
and the output produced by our algorithms. 
2.1 Identifying subeategorization frames 
Ill general, the problem of identifying subcatego- 
rization fi-ames is to distinguish between arguments 
and adjuncts among the constituents modifying a 
IOI/c of the ammymous rcviewcrs pointed out that (Basili 
and Vindigni. 1908) presents a corpus-driven acquisition of 
subcategorization frames for Italian. 
691 
f N4 R2(od) {2} 
N4 R2(od) R2(do) _ R2(od) R2(do) {0~j~/ 
{~}~..~ N4 R2(do){0} q~"--_//...~ 
N4 R6(v) R6(na) { 11 ~ N4 R6(v) l I} ~//ff_ 
N4 R6(po) {1 } /-----------~" 
R2(od) {0} 
R2(do) {0} 
R6(v) {0} 
R6(na) {0} 
R6(po) {0 } 
N4 {2+1+1} 
empty 10} 
Figure 2: Computing the subsets of observed frames for tile verb absoh,ovat. The counts for each frame are 
given within braces {}. In this example, the frames N4 R2(od), N4 R6(v) and N4 R6(po) have been observed 
with other verbs in the corpus. Note that the counts in this figure do not correspond to the real counts for the 
verb absoh,ovat in the training corpus. 
where c(.) are counts in the training data. Using 
the values computed above: 
Pl -- 7tl 
k2 P2 --= -- 
77, 2 
kl +k2 p -- 
7z 1 .-\]- 'It 2 
Taking these probabilities to be binomially dis- 
tributed, the log likelihood statistic (Dunning, 1993) 
is given by: 
- 2 log A = 
2\[log L(pt, k:l, rtl) @ log L(p2, k2, rl,2) -- 
log L(p, kl, n2) - log L(p, k2, n2)\] 
where, 
log L(p, n, k) = k logp + (,z -- k)log(1 - p) 
According to this statistic, tile greater the value of 
-2 log A for a particular pair of observed frame and 
verb, the more likely that frame is to be valid SF of 
the verb. 
3.2 T-scores 
Another statistic that has been used for hypothesis 
testing is the t-score. Using tile definitions from 
Section 3.1 we can compute t-scores using the equa- 
tion below and use its value to measure the associa- 
tion between a verb and a frame observed with it. 
T = Pl - P2 
where, 
p) = , p(1 - 
In particular, the hypothesis being tested using 
the t-score is whether the distributions Pi and P2 
are not independent. If the value of T is greater 
than some threshold then the verb v should take the 
frame f as a SF. 
3.3 Binomial Models of Miscue Probabilities 
Once again assuming that the data is binomially dis- 
tributed, we can look for fiames that co-occur with a 
verb by exploiting the miscue probability: the prob- 
ability of a frame co-occuring with a verb when it 
is not a valid SF. This is the method used by several 
earlier papers on SF extraction starting with (Brent, 
1991; Brent, 1993; Brent, 1994). 
Let us consider probability PU which is the prob- 
ability that a given verb is observed with a fiame but 
this frame is not a valid SF for this verb. p!f is the 
error probability oil identifying a SF for a verb. Let 
us consider a verb v which does not have as one of 
its valid SFs the frame f. How likely is it that v will 
be seen 'm, or more times in the training data with 
fi'ame f? If v has been seen a total of n times ill the 
data, then H*(p!f; m, 7z) gives us this likelihood. 
/ x 
H'(p,f~,n,)'L) = ~__,pif(1 t" )n-i( ~" ) 
- "f i 
i=rn. X / 
If H*(p; rn, n) is less than or equal to some small 
threshold value then it is extremely unlikely that the 
hypothesis is tree, and hence the frame f must be 
a SF of tile verb v. Setting the threshold value to 
0.0,5 gives us a 95% or better contidence value that 
the verb v has been observed often enough with a 
flame f for it to be a valid SE 
Initially, we consider only the observed fnnnes 
(OFs) from the treebank. There is a chance that 
some are subsets of some others but now we count 
only tile cases when the OFs were seen themselves. 
Let's assume the test statistic reiected the flame. 
Then it is not a real SF but there probably is a sub- 
set of it that is a real SE So we select exactly one of 
694 
tile subsets whose length is one member less: this 
is the successor of the rejected flame and inherits 
its frequency. Of course one frame may be suc- 
cessor of several longer frames and it can have its 
own count as OF. This is how frequencies accumu- 
late and frames become more likely to survive. The 
exalnple shown in Figure 2 illustrates how the sub- 
sets and successors are selected. 
An important point is the selection of the succes- 
sor. We have to select only one of the ~t possible 
successors of a flame of length 7z, otherwise we 
would break tile total frequency of the verb. Sup- 
pose there is m rejected flames of length 7z. "Ellis 
yields m * n possible modifications to consider be- 
fore selection of the successor. We implemented 
two methods for choosing a single successor flame: 
1. Choose the one that results in the strongest 
preference for some frame (that is, the succes- 
sor flmne results in the lowest entropy across 
the corpus). This measure is sensitive to the 
frequency of this flame in the rest of corpus. 
2° Random selection of the successor frame from 
the alternatives. 
Random selection resulted in better precision 
(88% instead of 86%). It is not clear wily a method 
that is sensitive to the frequency of each proposed 
successor frame does not perform better than ran- 
dom selection. 
The technique described here may sometimes re- 
sult in subset of a correct SF, discarding one or more 
of its members. Such frame can still hel l) parsers be- 
cause they can at least look for the dependents that 
have survived. 
4 Evaluation 
For the evalnation of the methods described above 
we used the Prague l)ependency Treebank (PI)T). 
We used 19,126 sentences of training data from tile 
PDT (about 300K words). In this training set, there 
were 33,641 verb tokens with 2,993 verb types. 
There were a total of 28,765 observed fiames (see 
Section 2.1 for exphmation of these terms). There 
were 914 verb types seen 5 or more times. 
Since there is no electronic valence dictionary for 
Czech, we evaluated our tiltering technique on a set 
of 500 test sentences which were unseen and sep- 
arate flom the training data. These test sentences 
were used as a gold standard by distinguishing the 
arguments and adjuncts manually. We then com- 
pared the accuracy of our output set of items marked 
as either arguments or adjuncts against this gold 
standard. 
First we describe the baseline methods. Base- 
line method 1: consider each dependent of a verb 
an adjunct. Baseline method 2: use just the longest 
known observed frame matching the test pattern. If 
no matching OF is known, lind the longest partial 
match in the OFs seen in the training data. We ex- 
ploit the functional and morphological tags while 
matching. No statistical filtering is applied in either 
baseline method. 
A comparison between all three methods that 
were proposed in this paper is shown in Table 1. 
The experiments showed that the method im- 
proved precision of this distinction flom 57% to 
88%. We were able to classify as many as 914 verbs 
which is a number outperlormed only by Manning, 
with 10x more data (note that our results arc for a 
different ). 
Also, our method discovered 137 subcategoriza- 
tion frames from the data. The known upper bound 
of frames that the algorithm could have found (the 
total number of the obsem, edframe types) was 450. 
5 Comparison with related work 
Preliminary work on SF extraction from coq~ora 
was done by (Brent, 1991; Brunt, 1993; Brent, 
1994) and (Webster and Marcus, 1989; Ushioda et 
al., 1993). Brent (Brent, 1993; Brent, 1994) uses the 
standard method of testing miscue probabilities for 
filtering frames observed with a verb. (Brent, 1994) 
presents a method lbr estimating 1)7. Brent applied 
his method to a small number of verbs and asso- 
ciated SF types. (Manning, 1993) applies Brent's 
method to parsed data and obtains a subcategoriza- 
tion dictionary for a larger set of verbs. (Briscoe 
and Carroll, 1997; Carroll and Minnen, 1998) dif- 
fers from earlier work in that a substantially larger 
set of SF types are considered; (Canoll and Rooth, 
1998) use an EM algorithm to learn subcategoriza- 
tion as a result of learning rule probabilities, and, in 
tnrn, to improve parsing accuracy by applying the 
verb SFs obtained. (Basili and Vindigni, 1998) use 
a conceptual clustering algorithm for acquiring sub- 
categorization fl'ames for Italian. They establish a 
partial order on partially overlapping OFs (similar 
to our Ot: subsets) which is then used to suggest a 
potential SF. A complete comparison of all the pre- 
vious approaches with tile current work is given in 
Table 2. 
While these approaches differ in size and quality 
of training data, number of SF types (e.g. intran- 
sitive verbs, transitive verbs) and number of verbs 
processed, there are properties that all have in con> 
mon. They all assume that they know tile set of pos- 
sible SF types in advance. Their task can be viewed 
as assigning one or more of the (known) SF types 
to a given verb. In addition, except for (Briscoe and 
Carroll, 1997; Carroll and Minnen, 1998), only a 
small number of SF types is considered. 
695 
Baseline Lik. Ratio q-scores Hyp. Testing 
Precision 55% 82% 82% 88% 
Recall: 55% 77% 77% 74% 
_h'f~: l 55% 79% 79% 80% 
% unknown 0% 6% 6% 16% 
Total verb nodes 
Total complements 
Nodes with known verbs 
Complements of known verbs 
Correct Suggestions 
True Arguments 
Suggested Arguments 
Incorrect arg suggestions 
Incorrect adj suggestions 
1027 
2144 
1027 
2144 
1187.5 
956.5 
0 
0 
956.5 
1 Baseline 2 
78% 
73% 
75% 
6% 
1027 
2144 
981 
2010 
1573.5 
910.5 
1122 
324 
112.5 
1027 
2144 
981 
2010 
1642.5 
910.5 
974 
215.5 
152 
1027 
2144 
981 
2010 
1652.9 
910.5 
1026 
236.3 
120.8 
1027 
2144 
907 
1812 
1596.5 
834.5 
674 
27.5 
188 
Table 1: Comparison between the baseline methods and the three methods proposed in this paper. Some of 
the values are not integers since for some difficult cases in the test data, the value for each argument/adjunct 
decision was set to a value between \[0, 1\]. Recall is computed as the number of known verb complements 
divided by the total number of complements. Precision is computed as the number of correct suggestions 
divided by the number of known verb complements. Ffl=l = (2 x p x r)/(p + r). % unknown represents 
the percent of test data not considered by a particular method. 
Using a dependency treebank as input to our 
learning algorithm has both advantages and draw- 
backs. There are two main advantages of using a 
treebank: 
• Access to more accurate data. Data is less 
noisy when compared with tagged or parsed in- 
put data. We can expect correct identification 
of verbs and their dependents. 
• We can explore techniques (as we have done in 
this paper) that try and learn the set of SFs from 
the data itself, unlike other approaches where 
the set of SFs have to be set in advance. 
Also, by using a treebank we can use verbs in dif- 
ferent contexts which are problematic for previous 
approaches, e.g. we can use verbs that appear in 
relative clauses. However, there are two main draw- 
backs: 
Treebanks are expensive to build and so the 
techniques presented here have to work with 
less data. 
All the dependents of each verb are visible to 
the learning algorithm. This is contrasted with 
previous techniques that rely on linite-state ex= 
traction rules which ignore many dependents 
of the verb. Thus our technique has to deal 
with a different kind of data as compared to 
previous approaches. 
We tackle the second problem by using the 
method of observed frame subsets described in Sec- 
tion 3.3. 
6 Conclusion 
We arc currently incorporating the SF information 
produced by the methods described in this paper 
into a parser for Czech. We hope to duplicate the 
increase in performance shown by treebank-based 
parsers for English when they use SF information. 
Our methods can also be applied to improve the 
annotations in the original treebank that we use as 
training data. The automatic addition of subcate- 
gorization to the treebank can be exploited to add 
predicate-argument information to the treebank. 
Also, techniques for extracting SF information 
fiom data can be used along with other research 
which aims to discover relationships between dif- 
ferent SFs of a verb (Stevenson and Merlo, t999; 
Lapata and Brew, 1999; Lapata, 1999; Stevenson et 
al., 1999). 
The statistical models in this paper were based on 
the assumption that given a verb, different SFs oc- 
cur independently. This assumption is used to jus- 
tify the use of the binomial. Future work perhaps 
should look towards removing this assumption by 
modeling the dependence between different SFs for 
the same verb using a multinomial distribution. 
To summarize: we have presented techniques that 
can be used to learn subcategorization information 
for verbs. We exploit a dependency treebank to 
learn this information, and moreover we discover 
the final set of valid subcategorization frames from 
the training data. We achieve upto 88% precision on 
unseen data. 
We have also tried our methods on data which 
was automatically morphologically tagged which 
696 
Previous 
work 
(Ushioda et al., 1993) 
(Brent, 1993) 
(Mmming, 1993) 
(Brent, 1994) 
(Ersan and Charniak, 1996) 
(Briscoe and Carroll, 1997) 
(CatToll and Rooth, 1998) 
Data 
POS + 
FS ntles 
raw + 
FS rules 
POS + 
FS rules 
raw + 
heuristics 
Full 
parsing 
Full 
parsing 
Unlabeled 
#SFs 
Current Work Fully Learned 
Parsed 137 
6 33 
6 193 
19 3104 
12 126 
16 30 
160 14 
9+ 3 
914 
Method 
heuristics 
Hypothesis 
testing 
Miscue 
rate 
NA 
iterative 
estimation 
Corpus 
WNJ (300K) 
Brown ( 1. IM) 
Hypothesis hand NYT (4.1 M) 
testing 
Hypothesis non-iter CHILI)ES (32K) 
testing estimation 
Hypothesis hand WSJ (36M) 
testing 
Hypothesis Dictionary various (7OK) 
testing estimation 
Inside- NA BNC (5-30M) 
outside 
Subsets+ Estimate PDT (300K) 
Hyp. testing 
Table 2: Comparison with previous work on automatic SF extraction from corpora 
allowed us to use more data (82K sentences instead 
of 19K). The performance went up to 89% (a 1% 
improvement). 

References 

Roberto Basili and Michele Vmdigni. 1998. Adapting a sub- 
categorization lexicon to a domain. In I'roceedings of 
the ECML'98 Workshop TANLPS: Towards adaptive NLP- 
d,iven systems: lingui'stic information, learning methods 
and applications, Chemnitz, Germany, Apr 24. 

Peter Bickel and Kjell l)oksum. 1977. Mathematical Statis- 
tics. Holden-Day Inc. 

Michael Brent. 1991. Automatic acquisition of subcategoriza- 
tion flames from untagged text. In Proceedings of the 29th 
Meeting of the AUL. pages 209-214, Berkeley, CA. 

Michael Brent. 1993. From grammar to lexicon: unsuper- 
vised learning of lexical syntax. ('Omlmtational Linguistics, 
19(3):243-262. 

Michael Brent. 1994. Acquisition of subcategorization frames 
using aggregated evidence fiom local syntactic cues. Lin- 
gmt, 92:433-470. Reprinted in Acqttisition of the Lexicon, 
L. Gleinnan and B. Landau (Eds.). MIT Press, Cambridge, 
MA. 

Ted Briscoe and John Carroll. 1997. Automatic extraction of 
subcategorization from corpora. In Proceedings of the 5th 
ANI, P Conference, pages 356-363. Washington. D.C. ACI,. 

John Carroll and Guido Minnen. 1998. Can subcategorisa- 
tion probabilities help a statistical parser. In Proceedings 
of the 6th AClJSIGDAT Workshop on Very lztrge ('orpora 
(WVLC-6), Montreal, Canada. 

Glenn Carroll and Mats Rooth. 1998. Valence induction with 
a head-lcxicalized PCFG. In Proceedings of the 3rd Confer- 
ence on Empirical Methods in Natural Language Processing 
(EMNLI' 3), Granada, Spain. 

Ted Dunning. 1993. Accurate methods for the statistics 
of surprise and coincidence. Computational Ling,istics. 
19( 1):61-74, March. 

Murat Ersan and Eugene Chamiak. 1996. A statistical syn- 
tactic disambiguation program and what it learns. In 
S. Wcrmter, E. Riloff, and G. Scheler. editors, Comwc- 
tionist, Statistical and Symbolic Approaches in Learning 
.fi~r Natural lxmguage I'rocessing, volume 1040 of Lecture 
Notes in ArtiJical Intelligence, pages 146-159. Springer- 
Verlag, Berlin. 

Jan ttaji,.? and Barbora ttladkfi. 1998. "Fagging inllective lan- 
guages: Prediction of morphological categories for a rich, 
structured tagset. In Proceedings of COLING-ACI, 98, Uni- 
versitd de Montrdal, Montreal, pages 483-490. 

Jan Itaji,L 1998. Building a syntactically annotated corpus: 
The prague dependency treebank. In Issues off Valency and 
Meaning, pages 106-132. Karolinum, Praha. 

Maria Lapata and Chris Brew. 1999. Using subcategorization 
to resolve verb class ambiguity. In Pascale Furtg and Joe 
Zhou, editors, Proceedings o1' WVL(TEMNI,I ~, pages 266-- 
274, 21-22 June. 

Maria Lapata. 1999. Acquiring lexical generalizations from 
corpora: A case study for diathesis alternations. In Proceed- 
ings q/37th Meeting olA( :L, pages 397-404. 

Christopher I). Manning. 1993. Automatic acquisition of a 
large subcategorization dictionary from corpora. In Pro- 
ceedil~gs of the 31st Meeting q/' the ACI,, pagcs 235-242, 
Columbus, Ohio. 

Suzanne Stevenson and Paola Merlo. 1999. Automatic verb 
classilication using distributions of grammatical features. In 
Proceedings of I'JACL '99, pages 45-52, Bergen, Norway, 
8-12 J une. 

Suzanne Stevenson, Paoht Merlo, Natalia Kariaeva, and Kamin 
Whitehouse. 1999. Supervised learning of lexical semantic 
classes using frequency distributions. In SIGLEX-99. 

Akira Ushioda, David A. Evans, Ted Gibson, and Alex Waibel. 
1993. The autonaatic acquisition of frequencies of verb st, b- 
categorization frames from tagged corpora. In B. Boguraev 
and J. Pustejovsky, editors, Proceedings of the Workshop on 
Acquisition of Lexical Knowledge fi'om 7kvt, pages 95-106, 
Columbus, Otl, 21 June. 

Mort Webster and Mitchell Marcus. 1989. Automatic acquisi- 
tion of the lexical frames of verbs from sentence frames. In 
Proceedings of the 27th Meeting of the ACL, pages 177-184. 
