Learning Probabilistic Subcategorization Preference 
by Identifying Case Dependencies and 
Optimal Noun Class Generalization Level* 
Takehito Utsuro Yuji Matsumoto 
Graduate School of Information Science, Nara Institute of Science and Technology 
8916-5, Takayama-cho, Ikoma-shi, Nara, 630-01, JAPAN 
{ut suro ,matsu}l~is. aist-nara, ac. jp 
Abstract 
This paper proposes a novel method of learning 
probabilistic subcategorization preference. In the 
method, for the purpose of coping with the ambi- 
guities of case dependencies and noun class gen- 
eralization of argument/adjunct nouns, we intro- 
duce a data structure which represents a tuple 
of independent partial subcategorization frames. 
Each collocation of a verb and argument/adjunct 
nouns is assumed to be generated from one of the 
possible tuples of independent partial subcatego- 
rization frames. Parameters of subcategorization 
preference are then estimated so as to maximize 
the subcategorization preference function for each 
collocation of a verb and argument/adjunct nouns 
in the training corpus. We also describe the results 
of the experiments on learning probabilistic sub- 
categorization preference from the EDR Japanese 
bracketed corpus, as well as those on evaluating 
the performance of subcategorization preference. 
1 Introduction 
In corpus-based NLP, extraction of linguistic knowl- 
edge such as lexical/semantic collocation is one of the 
most important issues and has been intensively stud- 
ied in recent years. In those research, extracted lex- 
ical/semantic collocation is especially useful in terms 
of ranking parses in syntactic analysis as well as au- 
tomatic construction of lexicon for NLP. 
For example, in the context of syntactic disam- 
biguation, Black (1993) and Magerman (1995) pro- 
posed statistical parsing models based-on decision- 
tree learning techniques, which incorporated not 
only syntactic but also lexical/semantic information 
in the decision-trees. As lexical/semantic informa- 
tion, Black (1993) used about 50 semantic categories, 
while Magerman (1995) used lexicai forms of words. 
Collins (1996) proposed a statistical parser which 
is based on probabilities of dependencies between 
head-words in the parse tree. In those works, lexi- 
cal/semantic collocation are used for ranking parses 
in syntactic analysis. 
*The authors would like to thank Dr. Hang Li of NEC 
C&:C Research Laboratories, Dr. Kentaro Inui of Tokyo 
Institute of Technology, Dr. Koiti Hasida of Electrotech- 
nical Laboratory, Dr. Tak_ashi Miyata of Nara Institute of 
Science and Technology, and also anonymous reviewers of 
ANLP97 for valuable comments on this work. 
On the other hand, in the context of automatic lex- 
icon construction, the emphasis is mainly on the ex- 
traction of lexical/semantic collocational knowledge of 
specific words rather than its use in sentence parsing. 
For example, Haruno (1995) applied an information- 
theoretic data compression technique to corpus-based 
case frame learning, and proposed a method of find- 
ing case frames of verbs as compressed representation 
of verb-noun collocational data in corpus. The work 
concentrated on the extraction of declarative repre- 
sentation of case frames and did not consider their 
performance in sentence parsing. 
This paper focuses on extracting lexical/semantic 
collocational knowledge of verbs for the purpose of ap- 
plying it to ranking parses in syntactic analysis. More 
specifically, we propose a novel method for learning 
parameters for calculating subcategorization prefer- 
ence functions of verbs. In general, when learning lex- 
ical/semantic collocational knowledge of verbs from 
corpus, it is necessary to cope with the following two 
types of ambiguities: 
1) The ambiguity of case dependencies 
2) The ambiguity of noun class generalization 
1) is caused by the fact that, only by observing each 
verb-noun collocation in corpus, it is not decidable 
which cases are dependent on each other and which 
cases are optional and independent of other cases. 2) 
is caused by the fact that, only by observing each verb- 
noun collocation in corpus, it is not decidable which 
superordinate class generates each observed leaf class 
in the verb-noun collocation. 
So far, there exist several researches which worked 
on these two issues in learning collocational knowl- 
edge of verbs and also evaluated the results in 
terms of syntactic disambiguation. Resnik (1993) 
and Li and Abe (1995) studied how to find an opti- 
mal abstraction level of an argument noun in a tree- 
structured thesaurus. Although they evaluated the 
obtained abstraction level of the argument noun by its 
performance in syntactic disambiguation, their works 
are limited to only one argument. Li and Abe (1996) 
also studied a method for learning dependencies be- 
tween case slots and evaluated the discovered depen- 
dencies in the syntactic disambiguation task. They 
first obtained optimal abstraction levels of the argu- 
ment nouns by the method in Li and Abe (1995), and 
then tried to discover dependencies between the class- 
based case slots. They reported that dependencies 
364 
were discovered only at the slot-level and not at the 
class-level. 
Compared with those previous works, this paper 
proposes to cope with the above two ambiguities in 
a uniform way. First, we introduce a data structure 
which represents a tuple of independent partial sub- 
categorization frames. Each collocation of a verb and 
argument/adjunct nouns is assumed to be generated 
from one of the possible tuples of independent par- 
tim subcategorization frames. Then, parameters of 
subcategorization preference are estimated so as to 
maximize the subcategorization preference function 
for each collocation of a verb and argument/adjunct 
nouns in the training corpus. We describe the results 
of the experiments on learning probabilistic subcate- 
gorization preference from the EDR Japanese brack- 
eted corpus (EDR, 1995), as well as those on evaluat- 
ing the performance of subcategorization preference. 
2 Data Structure 
2.1 Verb-Noun Collocation 
Verb-noun collocation is a data structure for the collo- 
cation of a verb and all of its argument/adjunct nouns. 
A verb-noun collocation e is represented by a feature 
structure which consists of the verb v and all the pairs 
of co-occurring case-markers p and thesaurus classes 
c of case-marked nouns: 1 
e 
pred : v 
pl : cl 
p~ : ck 
(1) 
We assume that a thesaurus is a tree-structured type 
hierarchy in which each node represents a semantic 
class, and each thesaurus class cl,..., Ck in a verb- 
noun collocation is a leaf class. We also introduce __c 
as the superordinate-subordinate relation of classes in 
a thesaurus: Cl ~c c2 means that cl is subordinate to 
C2. 
2.2 Subcategorization Frame 
A subcategorization frame f is represented by a feature 
structure which consists of a verb v and the pairs of 
case-markers p and sense restriction c of case-marked 
argument/adjunct nouns: 
pred : v 
pl : C1 
/ = 
Pt : el 
Sense restriction Cl,...,ct of case-marked argu- 
ment/adjunct nouns are represented by classes at ar- 
bitrary levels of the thesaurus. A subcategorization 
frame f can be divided into two parts: one is the 
verbal part fv containing the verb v while the other 
is the nominal part fp containing all the pairs of 
1Although we ignore sense ambiguities of case-marked 
nouns in this definition, in section 5.2, we briefly mention 
how we deM with sense ambiguities of case-marked nouns 
in the current implementation. 
case-markers p and sense restriction c of case-marked 
nouns. 
f=fvAfp----\[pred:v\]A\[ plpl ctCl \] 
2.3 Subsumption Relation 
We introduce subsumption relation -~ y of a verb-noun 
collocation e and a subcategorization frame f: 
e ___i f iff. for each case-marker Pi in f and 
its noun class cii, there exists the 
same case-marker pi in e and its 
noun class cie is subordinate to 
elf, i.e. cle ~c cly 
The subsumption relation ___y is applicable also as a 
subsumption relation of two subcategorization frames. 
3 A Model of Generating Verb-Noun 
Collocation 
In this section, we introduce a model of generat- 
ing a verb-noun collocation from subcategorization 
frame(s). In order to cope with the ambiguities of 
case dependencies and noun class generalization in 
this model, we introduce a data structure which repre- 
sents a tuple of independent partial subcategorization 
frames. 
3.1 Generating a Verb-Noun Collocation 
from Independent Partial 
Subcategorization Frames 
First, we describe the idea of generating a verb-noun 
collocation from a subcategorization frame, or a tuple 
of partial subcategorization frames. 
Generation from a Subcategorization Frame 
Suppose a verb-noun collocation e is given as: 
pred : v 
pl : Cle e 
Pk : Cke 
Then, let us consider a subcategorization frame f 
which can generate e. We assume that f has exactly 
the same case-markers as e has, 2 and each semantic 
class ci.f of a case-marked noun of f is superordinate 
to the corresponding leaf semantic class eie of e: 
pred : v 
pl ::clf 
/ = , cie _~c ciy (i=l,...,k) (2) 
L pk : Ckf J Then, we denote the generation of the verb-noun 
collocation e from the subcategorization frame f as: 
f ~ e 
Next, we describe the idea of generating a verb-noun 
collocation from a tuple of partial subcategorization 
frames which are independent of each other. 
2Since we do not consider ellipsis of argument nouns 
when generating a verb-noun collocation from a subcate- 
gorization frame, the subcategorization frame f is required 
to have exactly the same case-markers as e. 
365 
Partial Subcategorization Frame 
First, we define a partial subcategorization frame fi 
of f as a subcategorization frame which has the same 
verb v as f as well as some of the case-markers of f and 
their semantic classes. Then, we can find a division 
of f into a tuple (fl, -.-, f~) of partial subcategoriza- 
tion frames of f, where any pair fi and fi' (i ~£ i') 
do not have common case-markers and the unification 
f~ A.. • A f~ of all the partial subcategorization frames 
equals to f: 
f = fzA...Af. (3) 
pred : v 
VjYj' pij # Pi,j, fl---- pij:cij ' (i,i'=l,...,n, iCi') (4) 
Independence of Partial Subcategorization 
Frames 
We allow the division of f into a tuple (f\], ..., fn) 
of partial subcategorization frames as in the equation 
(3) only when the partial subcategorization frames fl, 
• .., fn can be regarded as events occurring indepen- 
dently of each other. With some corpus, usually we 
can estimate the conditional probabilities p(f I v) and 
p(\], I v) of the (partial) subcategorization frames \] 
and fi (i = 1,...,n) given the verb v. According 
to the estimated probabilities, we can judge whether 
fl,..., fn are independent of each other as follows. 
First, we estimate the conditional probability p(f I 
v) of a (partial) subcategorization frame f by sum- 
ming up the conditional probabilities p(e \[ v) of all 
the verb-noun collocations e given the verb v, where 
e is subsumed by f (e _/f)3 
p(flv) ~ ~p(cl~) (5) 
e~_ff 
The conditional joint probabilityp(fl,..., f, I v) is 
also estimated by summing up p(e I v) where e is 
subsumed by all of fl,..., fn (e _y fl,..., f,): 
p(fl .... 'f" Iv) ~" E p(e l v) (6) 
e~_ffl ..... fn 
Then, we give a formal definition of independence 
of partial subcategorization frames according to the 
estimated conditional probabilities: 
partial subcategorization frames fl,- -', fn are in- 
dependent if, any pair fi and fj (i # j) do not 
have common case-markers, and for every sub- 
set fi~,..., fir of j of these partial subcategoriza- 
tion frames (j = 2,..., n), the following equation 
holds: 
P(fil,''',fi r I v ) = P(fil l v) "''p(fi r I v) (7) 
Since it is too strict to judge the independence of 
partial subcategorization frames by the equation (7), 
3The probability p(e I v) can be estimated as 
freq(e)/freq(v) by M.L.E. (maximum likelihood estima- 
tion) directly from the training corpus. 
we relax the constraint of independence using a re- 
laxation parameter c~ (0 < a < 1). Partial subcatego- 
rization frames fl,..., fn are judged as independent 
if, for every subset fil,--., fit of j of these partial 
subcategorization frames (j = 2,..., n), the following 
inequalities hold: 
< P(/'l,-..,f'r Iv) < 1 (s) 
- p(l,1 Iv)--P(Y~J Iv) - 
Generation from Independent Partial 
Subcategorization Frames 
Now, as in the case of the generation from a sub- 
categorization frame f, we denote the generation of e 
from a tuple (fl, ..., fn) of independent partial sub- 
categorization frames of f as below: 
(fl,..,f,) , e 
3.2 The Ambiguity of Case Dependencies 
This section describes the problem of the ambiguity 
of case dependencies when observing verb-noun collo- 
cation in corpus. This problem is caused by the fact 
that, only by observing each verb-noun collocation in 
corpus, it is not decidable which cases are dependent 
on each other and which cases are optional and inde- 
pendent of other cases. 
For example, consider the following example: 
Example 1 
Kodomo-ga kouen-de juusu-wo nomu. 
child-NOM park-at juice-ACC drink 
(A child drinks juice at the park.) 
The verb-noun collocation is represented as a feature 
structure e below: 
e = ga : Cc wo : cj 
de : % 
In this feature structure e, co, cp, and cj repre- 
sent the leaf classes (in the thesaurus) of the nouns 
"kodomo(child)", "kouen(park)", and "juusu(juice)". 
Next, we assume that the concepts "human", "place 
", and "beverage" are superordinate to "kodomo(child) 
", "kouen(park)", and '~uusu(juice)", respectively, 
and introduce the corresponding classes Chum, Cplc, 
and Cbe~. Then, the following superordinate- 
subordinate relations hold: 
Cc "~c Churn~ Cp "~c Cplc~ Cj "~c Cbev 
Allowing these superordinate classes as sense restric- 
tion in subcategorization frames, let us consider the 
several patterns of subcategorization frames which can 
generate the verb-noun collocation e. Those patterns 
of subcategorization frames vary according to the de- 
pendencies of cases within them. 
If the three cases "ga(NOM)", "wo(ACC)", and 
"de(at)" are dependent on each other and it is not 
possible to find any division into a tuple of several in- 
dependent partial subcategorization frames, e can be 
regarded as generated from a subcategorization frame 
containing all of the three cases: 
ga : ch,m ~ e (9) 
WO : Obey 
de : epic 
366 
Otherwise, if only the two cases "ga(NOM)" and "wo(ACC)" 
are dependent on each other and the "de(at)" 
case is independent of those two cases, e can 
be regarded as generated from the following tuple of 
independent partial subcategorization frames: 
ga : Chum ~ de : Cplc 
WO : Cbev 
¢ (10) 
Otherwise, if all the three cases "ga(NOM)", "wo(ACC)", 
and "de(at)" are independent of each 
other, e can be regarded as generated from the fol- 
lowing tuple of independent partial subcategorization 
frames, each of which contains only one case: 
ga : Churn ' L wo : Cbev ~ de : Cplc 
e 
(11) 
3.3 The Ambiguity of Noun Class 
Generalization 
This section describes the problem of the ambiguity 
of noun class generalization when observing verb-noun 
collocation in corpus. This problem is caused by the 
fact that, only by observing each verb-noun colloca- 
tion in corpus, it is not decidable which superordinate 
class generates each observed leaf class in the verb- 
noun collocation. 
For example, let us again consider Example 1. 
We assume that the concepts "animal" and "liquid" 
are superordinate to "human" and "beverage", re- 
spectively, and introduce the corresponding classes 
Cani and ctiq. Then, the following superordinate- 
subordinate relations hold: 
Chum "~c Cani~ Cbtv -~c Cliq 
If we additionally allow these superordinate classes 
as sense restriction in subcategorization frames, we 
can consider several additional patterns of subcate- 
gorization frames which can generate the verb-noun 
collocation e, along with those patterns described in 
the previous section. 
Suppose that only the two cases "ga(NOM)" and "wo(ACC)" 
are dependent on each other and the "de(aQ" 
case is independent of those two cases as in 
the formula (10). Since the leaf class Cc ("child") can 
be generated from either Chum or Cani, and also the 
leaf class cj ("juice") can be generated from either 
Cbe v or eliq, e can be regarded as generated according 
to either of the four formulas (10) and (12),~(14): 
ga:ca~i ~ e (12) 
WO : Cbe v 
ga:ch,m ~ e (13) 
WO : Cliq 
ga:c~.i , e (14) 
WO : Cliq 
3.4 A Model of Generating Verb-Noun 
Collocation 
When observing each verb-noun collocation e, as we 
described in the previous two sections, the ambiguities 
of case dependencies and noun class generalization re- 
main, and it is necessary to consider every possible 
tuple of independent partial subcategorization frames 
which can generate the observed verb-noun collocation 
e. In order to cope with these ambiguities, we intro- 
duce two sets: one is a set F of tuples ~fl,...,fn> 
of independent partial subcategorization irames and 
the other is a set E of verb-noun collocations e. The 
generation of a verb-noun collocation from a tuple of 
independent partial subcategorization frames can be 
regarded as a mapping ~r from F to E: 
r : F ~ E (15) 
Usually, for each given verb-noun collocation in 
E, there exist several possible tuples of independent 
partial subcategorization frames in F. Thus, lr is a 
many-to-one mapping. The mapping from a tuple 
(f\],-.-, fn) of independent partial subcategorization 
frames to a verb-noun collocation e can be denoted 
also as follows: 
(fl,...,f,) ~ e (16) 
When observing a verb-noun collocation e, we as- 
sume this many-to-one mapping Ir and consider every 
possible tuple of independent partial subcategoriza- 
tion frames which can generate e, according to the 
ambiguities of case dependencies and noun class gen- 
eralization. 
3.5 Parameters of Generating Verb-Noun 
Collocation 
Before we give definitions of subcategorization prefer- 
ence functions in the next section, we introduce the 
parameter q(fk \] v) of generating verb-noun colloca- 
tion, which is used in the calculation of the subcate- 
gorization preference. The parameter q(fk \] v) can be 
regarded as the conditional probability of the partial 
subcategorization frame fk and could be estimated in 
the similar way as the p(f \[ v) in the formula (5). 
However, it is the parameter of generating verb-noun 
collocation and have to be estimated so as to maxi- 
mize the subcategorization preference function for the 
training corpus. 
One solution of this parameter estimation process 
might be to regard the model of generating verb-noun 
collocation as a probabilistic model and then to apply 
the maximum likelihood estimation method. When 
estimating the parameters from the training sample, 
we have to note that each verb-noun collocation is 
ambiguous since it could be interpreted in several dif- 
ferent ways according to case dependencies and opti- 
mal noun class generalization levels. As for param- 
eter estimation of probabilistic models from ambigu- 
ous training sample, EM algorithm(Baum, 1972) is a 
well-known solution and has been studied for years. 
In EM algorithm, parameters are assigned to events, 
and it is required that parameters sum up to 1. How- 
ever, since two subcategorization frames could have 
the same case and a subsumption relation could hold 
367 
between their sense restrictions, they may have over- 
lap and the requirement that parameters sum up to 
1 is not satisfiable. Therefore, it is not so straightfor- 
ward to apply EM algorithm to the task of parameter 
estimation of generating verb-noun collocation. 
Instead of introducing a probabilistic model of gen- 
erating verb-noun collocation 4, in this paper, we em- 
ploy more general framework which is applicable to 
various measures of subcategorization preference in- 
cluding the probability of generating verb-noun collo- 
cation. In the framework, the process of parameter 
estimation is regarded as a general optimization prob- 
lem of maximizing the subcategorization preference 
function for the training corpus. 
In order to describe the framework, first we intro- 
duce the probability P((fl,..., fn)j "'* el \[ el) of gen- 
erating a verb-noun collocation ei in the set E from a 
tuple (fx,..., f~)j in the set F, given ei, and denote 
it as a conditional probability P((fl,.-.,fn)j \[ ei). 
Then, for each ei in E, we can consider a probability 
distribution P((fl,..., f,)j \[ ei) over the set F of tu- 
pies of independent partial subcategorization frames: 
E 
el " " " el 
(fl,...,fn')l "'" 
F p((fl, ..-, fn)i led 
(fl ..... fn")m "'" 
Each probability distribution P( (fl,..., fn)j I ei) sat- 
isfies the following axiom of the probability: 
Y2v(ffl,-..,f,b I ei) = 1 for all i 
• J ..... According to the probabxhty distribution P((fl,..., 
fn)j \[ ei) of generating ei from (A, , fn)-, we esti- • :" . $ 
mate the frequency of the subcategonzatlon frame fk 
and then estimate the parameter q(fk I v) as below: 
E 1 • p((fl .... ,fk,...,f,)ilei) 
freq(fk) i,j q(fk Iv) ~ -- 
freq(v) freq(v) 
(17) 
When learning probabilistic subcategorization pref- 
erence (section 5), we estimate the probability distri- 
bution P((fl,..., fn)j \[ ei) for each ei so as to maxi- 
mize the subcategorization preference function for el. 
4 Subcategorization Preference 
Functions 
This section introduces a function ¢ which measures 
the subcategorization preference when generating a 
verb-noun collocation e from a tuple (fl,..-,f~) of 
independent partial subcategorization frames: 
¢((/a ..... f.) ---* e) (18) 
In this paper, we introduce a subcategorization pref- 
erence function which is based-on the idea of Kullback 
Leibler distance. 5 
4Another alternative of solving the problem of learn- 
ing probabilistic subcategorization preference based-on a 
probabilistic model is to .regard the problem as the con- 
struction of probabilistic models from the training sample. 
We will discuss this issue in section 7. 
5In Utsuro and Matsumoto (1997), we defined another 
subcategorization preference function ev which is based- 
4.1 Nominal Parts of (Partial) 
Subcategorization Frames 
First, let fp, fpl,'",fpn be the nominal parts of 
(partial) subcategorization fl'ames f, fl,-.., fn in the 
equations (2) and (4), respectively: 
fp = - 
Pk : Ck$ 
fP = fPl A"" Afp. 
fPi = pij : cij VjVj' PO # Pc j, ' (i, i'= 1 .... ,n, i#i') 
As in the case of the parameters q(fi \[ v) of fi 
given the verb v, we estimate the probability P(fpi) 
of the nominal part fpi in the whole corpus and call it 
the parameter q(fpi) of fpi in the whole training cor- 
pus. We estimate the frequency of fpi throughout the 
whole training corpus and then estimate the parame- 
ter q(hi) of fpl as below: 
E freq(fk) 
q(f,p ~ N 
~ 1. p(ffl .... ,h,...,S,b I ~,) 
i,j (19) N 
4.2 ¢kt: Kullback Leibler Distance 
Rather than the simple conditional probability, this 
preference function is intended to measure the 
information-theoretic association of the verb v and the 
nominal part of the subcategorization frame. 
The Kullback Leibler (KL) distance is a measure 
of the distance between two probability distribution. 
Given a random variable X and two probability dis- 
tributions p(X) and q(X), the KL distance D(p\[\[q) of 
p(X) and q(X) is defined as below(Cover and Thomas, 
1991), where each term can be regarded as the dis- 
tance of two probabilities p(z) and q(x) of an event x: 
D(Pllq) Y2 p(x)" p(x) = log q(x) 
xEX 
In order to apply the idea of the KL distance to mea- 
suring the association of the verb v and the nominal 
part fp of f, we introduce a random variable Fp which 
takes fp as its value• We also introduce the probability 
distribution p(Fp) of Fp and the conditional proba- 
bility distribution p(Fp \[ v) of Fp given the verb v. 
Then, the KL distance of p(Fp \[ v) and p(Fp) is 
denoted as D(p(Fp \[ v)Hp(Fp) ) and each term of it 
can be regarded as the distance of two probabilities 
p(fp \] v) and p(fp). We assume that the larger this 
distance is, the stronger the association of fp and v 
is, and measure the association of fp and v with this 
on the probability of generating the verb-noun collocation 
and described experimental results of applying ep to the 
task of learning probabilistic subcategorization. 
368 
distance of the two probabilities P(fv \[ t') and P(fv)" 
With this idea. the subcategorization preference func- 
tion okt is now formally defined as below: 6 7 
oktf(fl ..... f.) -'* e) 
= P(fp I t,)log p(fp Jr) (20) P( fv) 
n 
1-I P(fpi l V) n 
,~ rip(fpi \[ t,) × log i=1, (21) 
~=1 H p(fvl ) 
i=1 
n 
1-Iq(fpi I t') n 
"~ 1-I q(fP' t v) × log i=1~ (22) 
i=1 YI q(fPi) 
i=1 
(21) is derived from the independence of the partial 
subcategorization frames fl,..., fn. In (22), we use 
the parameters q(fvi I v) and q(fvi) as an approxima- 
tion of the probabilities P(fpi I r) and P(fpi)" 
5 Learning Probabilistic 
Subcategorization Preference 
The problem of learning subcategorization preference 
can be formalized as an optimization problem of es- 
timating the probability distribution P((fl,..., fn)j \[ 
el) (in section 3.5) of generating ei from (fx,..., fn)j 
(and then the parameters q(fpk Iv) and q(fPk)) so as 
to maximize the value of the subcategorization pref- 
erence function for the whole training corpus. In 
this paper, we give only an approximate solution to 
this problem: we estimate the probability distribu- 
tion P((fl,.-., fn)j \[ el) for each ei so as to maximize 
the value of the subcategorization preference function 
only for el, not for the whole training corpus. 
5.1 Problem Setting 
Let the training corpus C be the set of verb-noun col- 
location e. We define the subcategorization preference 
¢(e) of a verb-noun collocation e as the maximum of 
the subcategorization preference function ¢ (the for- 
mula (18)) of generating e from a tuple (fa,..., fn). 
(~(e) = max ¢((fl .... ,f~) ~ e) (23) (11 ..... Y~) 
Now, the problem of learning probabilistic subcate- 
gorization preference is stated as: 
for every verb-noun collocation e in C, es- 
timating the probability distribution P((fl, 
6Resnik (1993) applys the idea of the KL distance to 
measuring the association of a verb v and its object noun 
class c. Our definition of ekt corresponds to an extension of 
Resnik's association score, which considers dependencies of 
more than one case-markers in a subcategorization frame. 
7Another related measure is Dunning (1993)'s likeli- 
hood ratio tests for binomial and multinomial distribu- 
tions, which are claimed to be effective even with very 
much smaller volumes of text than is necessary for other 
tests based on assumed normal distributions. 
.... f~)j I e) of generating e from (fl. .... f~)j. 
under the constraint that the value 
of the subcategorization preference o(e) is maximized. 
5.2 Learning Algorithm 
First. we identi~, independent partial subcategoriza- 
tion frames according to the condition of (8). Then, 
let E(t') be the set of verb-noun collocations contain- 
ing the verb v in the training corpus ~. Let F(e) be the 
set of tuples (fl ..... fn) of independent partial sub- 
categorization frames which can generate e and satisfy 
the independence condition of (8). 8 
F(e) = {(f,,...,f.) (fl ..... f,)--e} (24) 
F(e) contains a tuple (f) consisting of only one 
subcategorization frame f only if f can not be di- 
vided into several independent partial subcategoriza- 
tion frames. 
Then, we assume that each element of F(e) occurs 
evenly and estimate the initial conditional probability 
distribution P((fl,..., f,)j I e) of generating e from 
(fl,..., fn)j as an approximation below: 
1 P((fl,... ,fn)j 
\[e) ,~ IF(e)l (25) 
5.2.1 Approximate Estimation of 
Verb-Independent Parameters 
Using the initial conditional probability distribution 
of P((fi,..., fn)j I e) as in the formula (25), the ini- 
tial values of the verb-independent parameters q(fPk) 
are estimated by the formulas (19). In the current im- 
plementation of the learning algorithm, we use these 
initial values as approximate estimation of those verb- 
independent parameters and probabilities throughout 
the learning process. 
5.2.2 Iterative Reestimation of 
Verb-Dependent Parameters 
Verb-dependent parameters q(fk I v)(= q(fpk \[ v)) 
are iteratively estimated so as to maximize the sub- 
categorization preference ¢(e) for every verb-noun col- 
location e in the training corpus C. As a learning al- 
gorithm, we employ the following stingy algorithm: 
1. Initialization 
As with the case of the verb-independent param- 
eters, for each verb-noun collocatoin e in C, the set 
F(e) is initially constructed according to the defini- 
tion in (24). Then, the initial conditional probability 
distribution of P((fl,..., fn)j \[e) and the initial val- 
ues of the verb-dependent parameters q(fk \[ v) are 
estimated as (25) and (17), respectively. 
Sin the current implementation, we deal with sense am- 
biguities of case-marked nouns and case ambiguities of 
Japanese topic-marking post-positional particles such as "ha(TOPIC)", "too(ALSO)", and "dake(ONLY)". 
When 
constructing the set F(e), we consider all the possible com- 
bination of senses of semantically ambiguous nouns and 
cases of topic-marking post-positional particles. These 
ambiguities can be resolved by maximizing the subcate- 
gorization preference function (section 5.2.2). 
369 
Table 1: The Result of Learning Probabilistic Sub- 
categorization Preference for "kau(buy, incur)" (¢kl, 
a=0.9) 
I II .... ,sp./}/Eg) I '~,,, leg s. I 
1 \[wo(AOO):14(Products)\] 1.88 158 
2 \[wo(ACC):13721-8(kabu(s~ock))\] 0.27 15 
3 \[ga(NOM):12(Human)\] 0.27 40 
4 \[wo(ACC):lh(Nature)\] 0.21 25 
5 \[kara(from):12(Human)\] 0.19 14 
6 \[de(at): 12(Shop,Place)\] 0.17 18 
7 \[ga(NOM):12(Human), 0.16 6 
wo( ACC):13721-8( kabu(stock))\] 
8 \[wo(ACC):13OlO(hukyou(disgust))\] 0.12 6 
9 \[wo(ACC): 11961-1(Currency)\] 0.10 6 
10 \[ga(NOM):12(Human),wo(ACC): 0.09 4 
1456(Musical Instruments)\] 
(llth,~lh0th) -- 196 
2. Iterative Reestimation 
The subcategorization preference ¢(e) are maxi- 
mized by repeatedly searching the set F(e) for tuples 
(fl,..., fn) which give the maximum subcategoriza- 
tion preference and removing other tuples from F(e). 
The following two steps are repeated until the values 
of the parameters q(fk I v) converge. 
(2a) For each verb-noun collocatoin e in £, set -~(e) 
as the set of tuples (fl,..., fn) of independent 
partial subcategorization frames which can gen- 
erate e and give the maximum subcategorization 
preference in the equation (23). 
(2b) Set the values of the conditional probabilities 
P((fl,..., fn)j \[ e) as below and the parameters 
q(fk Iv) as (17), respectively: 1 
P((fl,...,f,~)j I~) '- IF(e)l 
6 Experiments and Evaluation 
6.1 Corpus and Thesaurus 
As the training and test corpus, we used the EDR 
Japanese bracketed corpus (EDR, 1995), which con- 
thins about 210,000 sentences collected from newspa- 
per and magazine articles. From the EDR corpus, 
we extracted 153,014 verb-noun collocations of 835 
verbs which appear more than 50 times in the cor- 
pus. These verb-noun collocations contain about 270 
case-markers. We constructed the training set C from 
these 153,014 verb-noun collocations. 
We used 'Bunrui Goi Hyou'(BGH) (NLRI, 1993) 
as the Japanese thesaurus. BGH has a six-layered 
abstraction hierarchy and more than 60,000 words are 
assigned at the leaves and its nominal part contains 
about 45,000 words. Five classes are allocated at the 
next level from the root node. 
6.2 Experiments and Results 
From the training set C, we first estimated the values 
of verb-independent parameters as in section 5.2.1, 
and then iteratively reestimated verb-dependent pa- 
rameters of the subcategorization preference function 
Okl for 10 verbs as in section 5.2.2. For each of the 
10 verbs, the numbers of verb-noun collocations are 
100 ~ 500. We made experiments with the indepen- 
dence parameter a = 0.5/0.7/0.9. In the iterative rees- 
timation procedure, the values of the verb-dependent 
parameters converged after 2 -~ 5 iterations. 
For the 10 verbs, about 75% of the verb-noun collo- 
cations have only one case-marked noun. The rate 
that tuples of partial subcategorization frames are 
judged as independent increases as the value of the 
independence parameter a decreases. This rate in- 
creases from 1.4% (a=0.9) to 12.1% (a=0.5). 
As an example, for the verb "kau(buy, incur)", Ta- 
ble 1 shows the set _~(e) of tuples of independent 
partial subcategorization frames which give maximum 
subcategorization preference. The table lists the sets 
F(e) with 10 highest preference values of Ckl, along 
with the numbers (the column 'Egs.') of verb-noun 
collocations for each F(e), which are judged as gen- 
erated from it 9. Since about 75% of the verb-noun 
collocations have only one case-marked noun, most 
of the 10 high-scored sets have only one case-marked 
noun. However, the 10 high-scored sets cover about 
60% of the verb-noun collocations in the training set, 
and they can be regarded as typical subcategorization 
frames of the verb "kau(buy, incur)". 
6.3 Evaluation of Subcategorization 
Preference 
We evaluate the performance of the estimated param- 
eters of the subcategorization preference as follows. 
Suppose that the following word sequence repre- 
sents a verb-final Japanese sentence with a subordi- 
nate clause, where Nx,..., N2k are nouns, p~,... ,P2k 
are case-marking post-positional particles, Vl, v2 are 
verbs, and the first verb vl is the head verb of the 
subordinate clause. 
N~-p=- N11-p11 ..... N11-pl t-vl- N21-p21 ..... N2k-p2k-V2 
We consider the subcategorization ambiguity of the 
post-positional phrase N=-p,: i.e, whether N,-p, is 
subcategorized by Vl or v2. 
We use held-out verb-noun collocations of the verbs 
vl and v2 which are not used in the training. They 
are like those verb-noun collocations %1 and %2 in 
the left side below. Next, we generate erroneous verb- 
noun collocations e~l of vl and %2 of v2 as those in the 
right side below, by choosing a case element p, : N, at 
random and moving it from vl to v2. 
pred : vl wed : vl 
pll :Nll pll : Nll 
eel = ecl = 
pll : Nil ~ pit : Nil 
p~ : N= pred : v2 
pred : v2 p21 : N21 
P21 : N21 ee2 = " 
ec2 = 
p2k : N2k 
t)21 : N2k Px : Nx 
9In each subcategorization frame, .Japanese noun 
classes of BGH thesaurus are represented as numerical 
codes, in which each digit denotes the choice of the branch 
in the thesaurus. 
370 
Table 2: 
with tkl (%) 
Accuracies of Subcategorization Preference 
Independent Any 
c~=0.5 a=O.9 ~=0.5 (~=0.9 
Optimal + 81.7 70.7 65.8 68.6 
- 2.2 3.3 27.1 6.0 
Initial + 16.1 25.6 7.1 25.0 
- 0 O.4 0 0.4 
I Accuracy 97.8 96.3 72.9 93.6 t 
Applicability 83.9 74.0 92.9 74.6 
Then, we compare the sum ¢(ecl) + q~(ec2) of the 
maximums (in the definition (23)) of ¢kl for the cor- 
rect pair with the sum ¢(eel)+ ¢(ee2) of those for the 
erroneous pair, and calculate the rate that the correct 
pair has the greater value. 
For the purpose of evaluating the effectiveness 
of factors of learning probabilistic subcategorization 
preference, we perform experiments with different set- 
tings and compare their results. The following two 
options are examined: 
* Whether the subcategorization preference func- 
tion uses tuples of partial subcategorization 
frames judged as independent ("Independent"), or 
any tuples ("Any"). 
* The independence parameter c~=0.5/0.9. 
For three Japanese verbs "kau (buy, incur)", "nomu 
(drink)", and "kasaneru (pile up, repeat)", we ex- 
tracted pairs of correct verb-noun collocations and 
evaluated the performance of subcategorization pref- 
erence. Table 2 gives the results averaged over ex- 
tracted pairs, including the accuracies of subcat- 
egorization preference. The difference of "Opti- 
mal'/"Initial" means that initial values of the pa- 
rameters are used instead of optimized values (section 
5.2.2) when the subcategorization preference function 
is not applicable to the given verb-noun collocation 
and returns zero. The line "Accuracy" lists the sums 
of both "Optimal" and "Initial" accuracies, while the 
line "Applicability" lists the percentages of positive 
values of the subcategorization preference function 
with optimized parameters. 
It is natural that the settings with more weak con- 
ditions on the independence judgment of partial sub- 
categorization frames result in higher applicabilities. 
The setting with independent tuples of partial subcat- 
egorization frames achieves higher accuracy than that 
with any tuples, and this result claims that the result 
of the independence judgment is effective when apply- 
ing the estimated parameters to the task of subcate- 
gorization preference. Even in the case of the setting 
with any tuples, the setting with c~=0.5 gives poorer 
accuracy than that of ce = 0.9. In this case, the differ- 
ence of the independence parameter ~ affects only the 
parameter estimation stage. This result claims that 
the independence judgment process is effective also 
when estimating parameters from the training corpus. 
7 Conclusion 
This paper proposed a novel method of learning 
probabilistic subcategorization preference of verbs. 
We described a part of the results of the exper- 
iments on learning probabilistic subcategorization 
preference from the EDR Japanese bracketed cor- 
pus, as well as those on evaluating the performance 
of subcategorization preference. Although the scale 
of the evaluation experiment was relatively small, 
we achieved accuracies higher than 96%. The de- 
tails of the experimental results are available in 
Utsuro and Matsumoto (1997). As we mentioned in 
section 3.5, probabilistic model construction methods 
might be also applicable to the task of learning prob- 
abilistic subcategorization preference. We have al- 
ready applied the maximum entropy methods(Pietra, 
Pietra, and Lafferty, 1995; Berger, Pietra, and Pietra, 
1996) to this task(Utsuro, Miyata, and Matsumoto, 
1997) and are also planning to evaluate the effective- 
ness of the MDL principle(Rissanen, 1989) when com- 
bining with the maximum entropy method. Their re- 
sults will be compared with those of the method pro- 
posed in this paper and reported in the near future. 

References 

Baum, L. E. 1972. An inequality and associated maximization 
technique in statistical estimation for probabilistic functions of 
markov processes. Inequalities, 3:1-8. 

Better, A. L., S. A. D. Pietra, and V. J. D. Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71. 

Black, E. 1993. Towards history-based grammars: Using richer 
models for probabilistic parsing. In Proceedings of the 31st Annual Meeting of ACL, pages 31-37. 

Collins, M. 1996. A new statistical parser based on bigram lexical 
dependencies. In Proceedings of the 3,~th Annual Meeting of 
ACL, pages 184-191. 

Cover, T. M. and J. A. Thomas. 1991. Elements of Information 
Theory. John Wiley and Sons, Inc. 

Dunning, T. 1993. Accurate methods for the statistics of surprise 
and coincidence. Computational Linguistics, 19(1):61-74. 

EDR, (Japan Electronic Dictionary Research Institute, Ltd), 1995. 
EDR Electronic Dictionary Technical Guide. 

Haruno, M. 1995. Verbal case frame acquisition as data compression. In Proceedings of the 5th International Workshop on Natural Language Understanding and Logic Programming. 

Li, H. and N. Abe. 1995. Generalizing case frames using a thesaurus and the MDL principle. In Proceedings of International 
Conference on Recent Advances in Natural Language Processing, pages 239-248. 

Li, H. and N. Abe. 1996. Learning dependencies between case 
frame slots. In Proceedings of the 16th COLING, pages 10-15. 
Magerman, D. M. 1995. Statistical decision-tree models for parsing. In Proceedings of the 3Jrd Annual Meeting of ACL, pages 
276-283. 

NLRI, (National Language Research Institute), 1993. Word List 
by Semantic Principles. Syuei Syuppan. (in Japanese). 

Pietra, S. D., V. D. Pietra, and J. Lafferty. 1995. Inducing fea- 
tures of random fields. CMU Technical Report CMU-CS-95-144, 
School of Computer Science, Carnegie Mellon University. 

Resnik, P. 1993. Semantic classes and syntactic ambiguity. In Proceedings of the Human Language Technology Workshop, pages 278-283. 

Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry, 
volume 15 of Series in Computer Science. World Scientific Pub- 
lishing Company. 

Utsuro, T. and Y. Matsumoto. 1997. Learning proba- 
bilistic subcategorization preference and its application to 
syntactic disambiguation. Information Science Technical 
Report NAIST-IS-TR97006, Nara Institute of Science and 
Technology. 

Utsuro, T., T. Miyata, and Y. Matsumoto. 1997. Maximum entropy parameter learning of subcategorization preference. (submitted to the 35th Annual Meeting of ACL). 
