Learning Dependencies between Case Frame Slots 
Hang Li and Naoki Abe 
Theory NEC Laboratory, RWCP* 
c/o ('.&C. Research Labora.tories, NEC. 
4-1-1 Miyazaki Miyama.e-l~u, Kawasaki, 2116 Japan 
{lih ang,abe} (~.sbl.cl.nec.co.jp 
Abstract 
We address the problem of automati- 
cally acquiring case frame patterns (se- 
lectional patterns) from large corpus 
data. In particular, we l)ropose a method 
of learning dependencies between case 
frame slots. We view the problem of 
learning case frame patterns as that 
of learning a multi-dimensional discrete 
joint distribution, where random vari- 
ables represent case slots. We then for- 
mMize the dependencies between case 
slots as the probabilislic dependencies 
between these random variables. Since 
the number of parameters in a multi- 
dimensional joint distribution is expo- 
nential in general, it is infeasible to ac- 
curately estimate them in practice. To 
overcome this difficulty, we settle with 
approximating the target joint distribu- 
tion by the product of low order com- 
ponent distributions, based on corpus 
data. In particular we propose to employ 
an efficient learning algorithm based on 
the MDL principle to realize this task. 
Our experimental results indicate that 
for certain classes of verbs, the accuracy 
achieved in a disambiguation experiment 
is improved by using the acquired knowl- 
edge of dependencies. 
1 Introduction 
We address the problem of automatically acquir- 
ing case frame patterns (selectional patterns) from 
large corpus data. The acquisition of case frame 
patterns normally involves the following three 
subproblems: 1) Extracting case fl'ames from cor- 
pus data, 2) Generalizing case frame slots wMfin 
these case frames, 3) Learning dependencies that 
exist between these generalized case frame slots. 
In this paper, we propose a method of learn- 
ing dependencies between case frame slots. By 
*Real World Computing Partnership 
'dependency' is meant the relation that exists be- 
tween case frame slots which constrains the pos- 
sible values assumed by each of those slots. As 
illustrative examples, consider tile following sen- 
tences. 
The girl will fly a jet. 
This aMine company flies many jets. 
The gM will fly Japan AMines. 
*The airline conlpany will fly ,Japan Airlines. 
(1) 
We see that an 'airline company' can be the sub- 
ject of verb 'fly' (the value of case slot 'argl'), 
when the direct object (the value of ease slot 
'arg2') is an 'airplane' but not when it is an 'air- 
line company '1. These, examples indicate that the 
possible values of case slots depend in general on 
those of the other case slots: that is, there exist 
'dependencies' between different case slots. The 
knowledge of such dependencies is useflfl in var- 
ious tasks in natural language processing, espe- 
cially in analysis of sentences involving multiple 
prepositional phrases, such as 
The girl will fly a jet fl'om Tokyo to Beijing. (2) 
Note in the above example that the case slot of 
'from' and that of 'to' should be considered depen- 
dent and the attachment sit(." of one of the prepo- 
sitional phrases (case slots) can be determined by 
that of the other with high accuracy and confi- 
dence. 
There has been no method proposed to date, 
however, that learns dependencies between case 
frame slots in the natural language processing lit- 
erature. In the past research, the distributional 
pattern of each case slot is learned independently, 
1 One may argue that 'fly' has different word senses 
in these sentences and for each of these word senses 
there is no dependency between the case frames. Word 
senses are in general difficult to define precisely, how- 
ever, and in language processing, they would have 
to be disambiguated Dora the context ~nyway, which 
is essentially equivalent to assuming that the depen- 
dencies between case slots exist. Thus, our proposed 
method can in effect 'discover' implicit word senses 
fi'om corpus data. 
20 
and methods of resolving ambiguity are also based 
on the assuml:ition th.at case slots are independent 
(llindle and Rooth, 1991), or dependencies lm- 
tween at most two case slots are considered (Brill 
and Resnik, 1994). Thus, provision of an efl'ec- 
tive method of learning de, pendencies between (;as(; 
slots, as well as investigation of the usefulness of 
the acquired dependencies in disambiguation and 
other natural language processing tasks would be 
an inll)ortant contributiota to the fie.ld. 
In this paper, wc view the problem of learning 
(;as(? frame patterns as that of learning a lnulti- 
dimensional discrete joint distribution, where raw 
doni variables represent case slots. We then for- 
malize the dependencies between case slots as the 
probabilistic dependencies betweeit these ralldoiil 
variables. Since the illllllber Of dependencies that 
exist, in a nmlti-dimensiona.l joint disl.ribution is 
exponential if we allow n-ary dependencies in gen- 
eral, it is int>asible to accurately esi.itllate them 
with high accuracy with a data size available in 
practice. It is also clear that relatiw;ly few of these 
ra.ndom variahles (case slots) are actually depeit- 
dent on each other with any signiticance. Thus it 
is likely that the target joint distribution can be 
approximated reasonably well by the product of 
component distributions of low order, drastically 
reducing the nuniber (:if paralneters /.hat need to 
be considered. 'Fhis is indeed the apl>roach we 
take in this lmper. 
Now the probleni is how to approxilnal,e a ,joint 
distribution by the product of lower or<ler com- 
pOlletit distributions, llecently, (Suzuki, 1993) 
l)roposed a.ii algorithnl to approxhnal.cly learii a 
lnulti-dimensional joint distribution exlwessible as 
a 'dendroid distribution', which is both efticient 
and tlworet, ica.ily so/lnd. ~,.Ve employ Suzuki's al- 
gorithm 1,o learn case fralim patterns ;is dendroid 
distributions. We conducted sollle experinlelits to 
automatically acquire case fi'alne patterns from 
the Penn 'Free Bank bra.cketed corpus. Our ex- 
perimental results indicate that for seine class of 
verbs the accuracy achiew?d ill a disa.nlbiguni.ion 
experinlent can be inlproved by using the acquired 
knowledge of dependencies between case slots. 
2 Probability Models for Case 
Frame Patterns 
Suppose that we haw? data given by ills(antes of 
the case frame of a verb automatically extracted 
from a corpus, using conventional techniques. As 
explained in Introduction, the l:irol~lelu of learning 
case fraille l)atteriis ca.it be viewed as that of es- 
tilnating the unde~rlying mulli-dimemsioltal joilll 
distribulioT~ which giw~s rise to such data. 111 
this research, we assume that <'as(.' t}ame instances 
with the same head are generated by a joint dis- 
tribution of type, 
I'~, (&, X~,..., X,,), (:3) 
where index Y stands for the head, and each of the 
randonl variables Xi,/ = 1,2,..., n, represents a 
case slot. In this paper, we use 'case slots' to mean 
re,face case slots, and we uniformly treat obliga- 
tory cases and optional cases. 'rhus the muN)er 
n of the random variables is roughly equal to the 
nunfl)er of prepositions in English (and less than 
100). These models can be further classified into 
three types of probability models according to the 
type of values each random variable. Xi assumes 2. 
When Xi assumes a word or a special symbol '0' 
as its value, we refl:r to the corresponding model 
Pv (Xi,. •., X,) as a 'word-based model.' Here '0' 
indicates the absence of the case slot in question. 
When Xi assumes a. word-class or '0' as its value, 
the corresponding model is called a 'class-based 
model.' When Xi takes on 1 or 0 as its value, 
we call the model a 'slot-based model.' Here the 
value of 'l' indicates the presence of the case slot 
in question, and '0' al>sence. Suppose for sim- 
plicity that there are only 4 possible case slots 
(random variables) corresponding respectively to 
the subject, direct object, 'front' phrase, and 'to' 
phrase. Then, 
l'flv(X.,.,at = girl, X.,.g2 = jet, Xf,. .... = 0, X~o = O) (4) 
is given a specific l)robability value by a word- 
based model. In contrast, 
Ig,u(X<,,.ai = <person), S.,.:,~ = (airplane), 
Xf,.o,, = O, Xto = O) (a) 
is given a specilic l)robability by a class-based 
,nodel, where (l,e,'son) alid (airplane) denote~ word 
classes. Finally, 
l)tzy(X,,.,a~ = 1,X~,.au = 1, X.r,.o,,, = O, Xto = O) (o) 
is assigned a specific probability by a slot-based 
model. 
We then forlmllale the dependencies between 
case slots as the probabilislic dependencies be- 
tween the randonl variabh~s in each of these three 
trtodcls. In the absence of any constraints, how- 
ever, the number of parameters in each of the 
above three lnodels is exponential (even the slot- 
based model has 0(2") parameters ), and thus it 
is infeasible to accurately estimate them in prac- 
tice. A simplifying assumption that is often made 
to deal with this difficulty is that random variables 
(case slots) are mutually independent. 
Sul)pose for examl:ile that in the analysis of the 
setltellCe 
l saw a girl with a t.elescope, (7) 
two interpretatiolls are obtained. We wish to se- 
lect. the nlore appropriate of the two in(eft:itera- 
tions. A heuristic word-based method for disam- 
biguation, in which the slots arc assumed to be 
2A representation of a probability distribution is 
usually called a probability model, or simply a model. 
22 
dependent, is to calculate tile following values of 
word-based likelihood and to select tile interpreta- 
tion corresponding to the higher likelihood value. 
Psee(Xa,',.1t =" \[, Xar92 = girl, )l'~uit h ~- telescope) (s) 
P.~.~(Xa,.al = I, Xa,.oe = girl) (9) 
x l~li,.l( X~,,io,. = telescope ) 
If on the other hand we a.ssume that the ran- 
dom variables are independe'~l, we only need to 
calculate and compare t~,:(X~,iH, = telescope) 
and Pgi,'t(.\'with = telescope) (c.f.(Li and Abe., 
1995)). The independence assumption can also 
be made in the case of a class-based model or a 
slot-based model. For slot-based models, with tile 
independence assumption, P.~(X,~,ith = 1) and 
Ps,.l(Xwitfl = 1) are to be compared (c.f.(Hindle 
and Rool:.h, 1991)). 
Assuming that random variables (case slots) 
are mutually independent would drastically re- 
duce tile number of parameters. (Note that. un- 
der the independence assuml)tion tile nmnber of 
parameters in a slot-based model becomes 0(~).) 
As illustrated in Section 1, t.his assumption is not 
necessarily valid in practice. What seems to be 
true in practice is that some case slots are ill fact 
dependent but overwhelming majority of t.hem a.re 
independent, due partly to the fa.cl that usually 
only a few slots are obligatory and most others 
are optional. :~ Thus the target, joint distribution 
is likely to be a.pproximabie by the product of 
several component distributions of low order, and 
thus have in fact a reasonably small number of 
parameters. We are thus lead to the approach 
of approximating tile tal:get joint distribution by 
such a simplified model, based on corpus data. 
3 Approximation by Dendroid 
Distribution 
Without loss of generality, any n-dinlensiorlal joint 
distribution can be writl.en as 
P(xi, x._, ..... x,,) = H P(x,,, IX ..... ....x%,_,) 
i=1 (1o) 
for some pernnttation (mq, m._, .... nb~ ) of 1, 2 .... n, 
here we let P(X,~,I x ..... ) denote FIX,,,,). 
A pta.usib\[e assumption on I.he dependencies be- 
tween random variables is intuitively that each 
variable direetbj depends oil at most one other 
variable. (Note that this assumption is tile sim- 
plest among those that relax the independence a.s- 
sumption.) For example, if a joint distribution 
P(X1, X,,, X:3) over 3 random variables X1, X2, Xa 
aOptiona.1 slots ~tre not necessarily independent, 
but if two optional slots are randomly selected, it is 
likely that they are indet)endent of one a.nother. 
can be written (approximated) as follows, it (al> 
proximately) satisfies such an assumption. 
P(.z¥1,-"k2, X3 ) : (~,"~)P(-\'1 )'/)(X2 IX1 ). P(X:, IX\[ ) (11) 
Such distributions are referred to as 'dendroid dis- 
tributions' in tile literature. A dendroid distribu- 
tion can be represenled by a dependency forest 
(i.e. a set of dependency trees), whose nodes rep- 
resent the random variaMes, and whose directed 
arcs represent the dependencies that exist between 
these random w/riahles, each labeled with a num- 
ber of parameters specil}'ing the probabilistic de- 
pendency. (A dendroid distribution can also be 
considered as a re.stricted form of the Bayesian 
Network (Pearl, 1988).) It is not difficult t.o see 
tha.t there are 7 and only 7 such representations 
for the joint distribution P(X1, X,2, X3) disregard- 
ing the actual nmnerical values of t.he probability 
parameters. 
Now we turn to the problem of how to select the 
best dendroid distribution fi:om among all possi- 
ble ones to approximate a target joint distribution 
based on input data generated by it. This prob- 
lem has been inw?stiga.ted in the area of machine 
learning and related fields. A classical method is 
Chow & Liu's algorMnn for estimating a nmlti- 
dimensional .joint distribution as a dependency 
tree, ill a way which is both el-~cient and theo- 
retically sound (C.how and I,iu, 1968). More re- 
cent.ly (Suzuki, 1993) extended their algorithm so 
that it estimates the target ,joint. distribution as 
a dependency Forest. or 'dendroid distrihution', al- 
lowing for the possibility of learning one group 
of random variables to be completely independent 
of another. Since nlany of the random variables 
(case slots) in case flame patterns are esseutially 
independent, this feature is crucial in our context, 
and we thus employ Suzuki's algorithm for learn- 
ing our case frame patterns. Figure 1 shows the 
detail of this Mgorithm, where ki denotes the nun> 
her of possible values assumed by node (random 
variable) Xi, N the input data size, and qog' de- 
notes the logarithm to the base 2. It is easy to 
see that the nulnber of parameters in a dendroid 
distribution is of the order O(k2ne), where k is 
the maxinmni of all ki, and n is the. number of 
random variables, and the time complexity of the 
algorithm is of the same order, as it is linear in 
the number of parameters. 
Suzuki's algorithm is derived from the Mini- 
mum Description Length (MDL) principle (liis- 
sanen, 1989) which is a principle for statistical es- 
timation in information theory. It is known that 
as a. method of estimat.ion, MI)L is guaranteed 
to be near optinm.l 4. \[n applying MDL, we usu- 
ally assume that the given data are generated by a 
probability model t.hat belongs to a certain class of 
models and selects a model within tile class which 
4We reDr the interested reorder 1o (Li and Abe, 
1995) for an introduction to MDL. 
22 
I,et 7' := (/); ('.alculat.e 1,he muttm| in\[ol:~nat.ion 
I( Xi, X5 ) for all uo(:t(~ pairs (,Y/, Xj ); Sort. 1\]w 
node pairs in d(~scen(liug o\]'(h+r of l, and stor(~ 
l.hent int.o qm'ue Q; l,(;t V 1)c /.he set of {Xi}, 
i =: 1,2, ...,~\]: 
whih'+ The llla.xittltlltl vahw of l in Q saris\[its 
\](&., :v~) > o(x~ &) = (<:- t)(a,~ 1)>~" ' ' 2 N 
do t)(;gin 
I~muov(" tlw nod(> l)air (,\7i. ,\+j) h;/vil~g th(, 
ni;/xi\]~mil+ v;t.ltw <)I' / \['ro~t Q; 
If" ,\7+ aml Aj I>(,lot~g to diIl'('r(mt, scls It+, 
ll':+, in 1; 
Them I{el)lac(> IVI a.n(l II +., in l wilh 
H'I U I1":,, and add edge (,\i. A'j ) 10 "\[': 
end 
Output. 7' as 1.ho set. of (xlgcs o1' the ('stitnal('(t 
model. 
l"ig.trc l: The hm.rtfing algoril.hul 
best ('xplaii> l.he dal.a. I1. i.(m(ls Io I~(' llw ('asc 
usua.Hy t;hal, a simllh'r model has a l)oor(,r Ill t.o 
1,he dal.a, a.H(/ a nlore complex mo(hq h+ts a l+(,i,l,(q: 
fil I lIO I'll() (la.t.a. Thus t,h('l:e is n t.rad('-ofI' I>ctw(>cn 
t,t> simplicity of a mod(q gum l.h(' go(>dn('ss of lit. to 
data. M1)I, resolves I.his I.ra(h~-<)\[l' in a (lis('il>ti\[>d 
way: 11. s(eh,cl.s a Illod('l which is i'(msonably silu- 
I/l(> a.nd fits l.he data sal.isl"acl.orily as w('\[l. In our 
('lil'I;('l/l prol)l(ml, a :-;iltlI)\](? IHod('l iil(:;tl/S ;t IIIC)(\[('I 
wil.h less d(q)('l~(l(mcies, and thus Nil)l, l)rovi(l(,.-, 
a (h(?or(q.ic;dly sound way 1.o learn ()Ill N Ihosc &, 
pcq\]dcncies thai, arc sl.al.isticMly signilicant in Ill(: 
given (\[al;a. Air esp(~c\[;dly iJll('t'(,s(iug \[}~alur(~ of 
MIll, is l\]lal it. incorl)orat.es l:he il\]l)tll, dala size 
in it.s model soh>ct.ion crit.crion. 'l'his is rcfl~'('led, 
in our (u~,s(>, in t.hc <terival.i(>n ()l' 1,h(' thr('sh(,hl O. 
Nol.e l, haI, wh(m wc (lo not, \]l;iv(~ enough data (i.e. 
\[or smfll N), the thr(>shohls will b(' large and 
Ibw nodes Icn(I 1.o 1)c Iinlccd, rcsuliillg ill a sil\]l- 
pie mod('l ill which most. o\[ t,l> ('as(> tTr+m> slots 
arc ,jtt(lgc'd in(h':l)(m,,hml.. This is r(uts(>na.lA(, since 
with a small data. size most cas,, slot> cam\]oi I)(, 
degermin(xl i.o I)c dep(m(h-\]tt with a.uy significance. 
4 Experimental Result;s 
\~"o COl\](\[/l(%.'."(I soltt(" l)r(,lindluu'y ('xp('ritn(qtts to 
lest. the i;(,rl'otul|atlc(, o\[ t.hc l;l',.)lt()s(+(l tt+('th()(/ as ;\] 
m(,I.ho(I o1' +requiring ('aso l'r+uu(' i);tt~cru~, lit i);n +- 
1.icular, wc t.cs(('(l t.o see hoxx cl\[(?('tiv( ~ th(> p;tl t(q'us 
a.cquired by our nJ('lhod ar<' i\]~ slruct ural disam- 
biguation. \V(' will dcs(:rib(' the resull.s o17 this ex- 
porin\]cntation in this sccl;ion. 
4.1. Experintt;i~tt, 1: Slol;-basc, d Model 
lu otn' tirsl, cxp(erim(,nt, w(, Iri('d io a('(luir(' slot 
});~s(~(I case f'ra.tt\]e patt.(u:us. Fil'sl., W(' ('xl.r;t('l.('(\[ 
18 t ,250 case fra.ules from l,hc Wall S1 r('(>t .l()u rnal 
(WSJ) I)rackcted COl'IreS o/' l,\]tc I'enu 'lrve I~ank 
as t;t:a.iniug data.. Thor(> w('t'(~ 357 vcrl)s \[or which 
'\['al)le 1: Verbs and l:hoir l)erptexity 
Verb I ndel)(mdent l)(mdroid 
ndd 5.g2 5.36 
buy 5.0,11 4.98 
find 2.07 1.92 
ol)(m 20.5(3 16.53 
l)rot.c('t. :L3!) 3.13 
l/rovid(> ,l.46 4.13 
r(?t)r(,s(m t 1.2G 1.26 
s(qld 3.20 3.29 
s u(:cc(+d '23)7 2.57 
tell 1.3(5 1.36 
more (,hmi 50 cas(~ frame examph~s appeared in lira 
lraining data. 
lqrsl, wo acquit>d l,hc sloi-bascd case flame pal- 
iOI'|IS for ;Ill of t.he 357 verbs. \,'V(~ \[lll(~ ii (~()t~ (l ~ \[(: to(l ~ 
I,cwfohl cross va\]idai, ion to cva\[uai,e tlw %esI, data 
p(u:ph~xii,y' of t,/w acquired case frame pat, terns, 
that is, w¢~ used nine l,(ml, h o\[ the case flames %r 
each verb as training dat,a (saving what, rema.ins 
as t, es(, data), t,o acquire case flame pai, l, erns, and 
then calculalcd pCrl)lexil. ¥ using the lesl, data. VV(> 
rel>Catc'd this procoss t.cn lim(~s a.nd calculated tlm 
;tvcragc l)Crl)lexity. '\['able I shows the average per- 
plexit.y ()btmm'd for some randomly s('h'ctcd verbs. 
\Ve also calculat.cd tim av(u:age perplexil.y of the 
qndcpcndettt, slof nlodcls' acquired bas(~d on 1.h(' 
assumpt, ion t, hal. (~ach slof is hMepcmhml,. Our ex 
l)crimenl,al rcsull, s shown in 'l'able 1 indicate (hal 
1.he use o\[ t.he +'ndroid models can achieve up t.o 
2()~. pcrpl(~xil:y reducl ion as COmlmt'ed ~o the imb- 
\[)Ol|d(Hll, slot ll\](,)(\[OIS. It scorns sail" lo say lhere\['ore 
thai the dendroid utodcl is more sttitablc I'or rcp- 
rcscnl:ing the Ira+ model o\[' case flames than l.\[w 
hMq)emlcnl s\]ol. lttOdO\[. 
\Vc also used lhe acquir(>d depend(racy knowl- 
c+,{gc ill a pl>at, l, achmenl, disambiguai.ion experi- 
i\]lol\]l., kV(' used the case h'an\]~s of' all 357 verbs 
as o\]tr t.raining dat.a. Wc used Chc cttl:irc + brack- 
etc<l corpus as Iil'a.illillg dat.a it\] part: because wc 
wanl.ed t.o utilize as many t.raining data as possi- 
ble. We ext.ract.(<l (c~ rb, ,ou?q, prep, ?,)tt?~2) or 
( v(,A,, t.'cpt, ~otml, prr p.2, ~ou~\]2) pat.terns \['rotlt 
the \VSJ tagged ('orplts ;ts i,est. (lata, ItSillg pa.t- 
tc\]'n matching tccl!t\]iqucs. \Vc t.ook care to ensure 
ihal otlly t, hc part. o\[' l\[w (agg('d (non-l)rackt,t.cll) 
corlms which do(,s not ov('rlap xxit.h the I)rack('l,('(I 
corptlS is tlSC(I a,'< test. dai.a. (The bracl,:(,ted corlms 
lots overlap wii.h i)arl, o\[ the t,ttgg~x:\[ (orpus.) 
\Vc acquired ('aso \[ratne pal t.crns using t, hc 
|.raining da, ta. \V~ found l:hai there were 266 
v<wl>s, whose 'arg2' slot is (tel'~(qtdc'lll Ol1 SOl\]tO. 
of i, hc ot,lwr preposition slots. 'l'hm'v were 37 
(Se~' exmr@es in 'l'al)lc 2) verbs whose depen- 
(h>l\]cy I)cl,w(>en ;u:g2 and ol, hcr slots is positAv(, 
atl(l (~x(:o,,d.'-; a COl;t.ailt threshold, i.e. Play92 - 
l,pr+p = J) 2> 0.25. '1'11(> depend(moles \[ound 
:1_3 
by our method seem to agree with human intu- 
ition in most cases. There were 93 examples in 
Table 2: Verbs and their dependent slots 
Verb Dependent slots 
add 
blame 
buy 
climb 
compare 
convert 
defend 
explain 
file 
focus 
arg2 to 
arg2 for 
arg2 for 
arg2 from 
arg2 with 
a.rg2 to 
arg2 against 
arg2 to 
arg2 against 
arg2 on 
Table 3: Disambiguation results 1 
Dendroid 
Independent 
Accuracy(%) 
90/93(96.8) 
79/93(84.9) 
the test data ((verb, nounl,prcp, no'an2) pattern) 
in which tile two slots 'a.rg2' and prep of verb 
are determined to be positively dependent and 
their dependencies are stronger than tile thresh- 
old of 0.25. We forcibly attached prep nou~t2 to 
verb for these 93 examples. For comparison, we 
also tested the disambiguation method based on 
the independence assumption proposed by (Li and 
Abe, 1995) on these examples. Table 3 shows 
the results of these experiments, where 'Dendroid' 
stands for the former method and 'Independent' 
the latter. We see that using tile information on 
dependency we can significantly improve the dis- 
ambiguation accuracy on this part of the data 
Since we can use existing methods to per- 
form disambiguation for the rest of the data, we 
can improve the disambiguation accuracy for the 
entire test data using this knowledge. Further- 
more, we found that there were 140 verbs hav- 
ing inter-dependent preposition slots. There were 
22 (See examples in Table 4 ) out of these 140 
verbs such that their ease slots hawe positive de- 
pendency that exceeds a certain threshold, i.e. 
P(prepl = 1,prep2 = 1) > 0.25. Again the de- 
pendencies found by our method seem to agree 
with human intuition. In the test data (which 
are of verb,prep:t,nount,prep~, nou~ pattern), 
there were 21 examples that involw? one of the 
above 22 verbs whose preposition slots show de- 
pendency exceeding 0.25. We forcibly attached 
bot.h prep, no'unl and prep2 noun2 to verb on 
these 21 examples, since the two slots prept and 
prep~ are judged to be dependent. Table 5 shows 
the results of this experimentation, where 'Den- 
droid' and 'Independent' respectively represent 
Table 4: Verbs and their dependent slots 
Head Dependent slots 
acquire 
apply 
boost 
climb 
fall 
grow 
improve 
raise 
sell 
think 
froII1 for 
for to 
from to 
from to 
from to 
fi'om to 
from to 
fl'om to 
to for 
Of as 
the method of using and not using the knowl- 
edge of dependencies. Again, we found that for 
the part of the test data in which dependency is 
present, the use of the dei)endency knowledge can 
be used to improve the accuracy of a disambigua- 
tion method, Mthough our experimental results 
are inconclusive at this stage. 
Table 5: Disambiguation results 2 
Accuracy(%) 
Dendroid 21./21(100) 
Independent 20/21(95.2) 
4.2 Experiment 2: Class-based Model 
We also used the 357 verbs and their case frames 
used in Experiment 1 to acquire class-based case 
frame patterns using the proposed method. We 
randomly selected 100 verbs among these 35r 
verbs and attempted to acquire their case frame 
patterns. We generalized the case slots within 
each of these case frames using the method pro- 
posed by (Li and Abe, 1995) to obtain class-based 
case slots, and then replaced the word-based case 
slots in the data with the obtained class-based 
case slots. What resulted are class-based case 
frame examples. We used these data as input to 
the learning algorithm and acquired case frame 
patterns for each of' the 100 verbs. We found iJmt 
no two case slots are determined as dependent in 
any of the case frame patterns. This is because 
the number of parameters in a class based model 
is very large compared to the size of the data we 
had available. 
Our experimental result verifies the validity in 
practice of the assumption widely made in statis- 
tical natural language processing that class-based 
case slots (and also word-based case slots) are mu- 
tually independent, at least when the data size 
available is that provided by the current version 
of the Penn Tree Bank. This is an empirical find- 
ing that is worth noting, since up to now the in- 
dependence assumption was based soMy on hu- 
24 
/ / 
/ / 
...? ............... 
///~'"" j:/ 
.. / 
2.5 " 
Figure 2: (a) Number of dependencies versus data size and (b) KL distance versus data size 
man intuit, ion, to the best of our knowledge. To 
test how large a data size is required to eslimate 
a class-based model, we conducted the following 
experiment. We defined an artifMal class-based 
model and genera.ted some data. according to its 
distribution. We then used the data to estimate 
a class-based model (dendroid distribution), and 
evaluated the estimated model by measuring the 
mlmber of dependencies (dependency arcs) it has 
and the KL distance between the estimated model 
and the true model. We repeatedly generated data 
and obserwed the learning 'curve', nan,ely the re- 
lationship between the number of dependencies in 
the estimated model and the data. size used in esti- 
mation, and the relationship betweett the KI, dis- 
tance between the estimated and true modols and 
the data size. We defined two other models and 
conducted the same experiments. Figure 2 shows 
the results of these experiments for these three ar- 
tificial models averaged ower tO trials. (The num- 
ber of parameters in Modell, Model2, and Model3 
are 18, 30, and 44 respectiv(_'ly, while the number 
of dependencies are 1, 3, aud 5 respectively.) We 
see that to accurately estimate a model the data 
size required is as large as 100 times the nmnber 
of parameters. Since a class-based mode\[ tends to 
have more than 100 parameters usually, the cur- 
rent data size available in the Penn Tree Bank is 
not enough for accurate estimation of the depen- 
dencies wilhin case fi'antes of most verbs. 
5 Conclusions 
We conclude this paper with the following re- 
marks. 
1. The primary contribution of research re- 
ported in this paper is that we ha.ve proposed 
a method of learning dependencies between 
case fi'ame slots, which is theoretically somld 
and elficient, thus 1)roviding au effective tool 
for acquiriug (;as(' depend(racy information. 
2. For the sk)t-based too(M, sometimes case 
slots are found to I)e del)endent. Experimeu- 
t.al results demonstrate that using the depen- 
dency information, when dependency does 
exist, structural disambignation results can 
be improved. 
3. For the word-based or class-based models, 
case slots are judged independent, with the 
data size cm'renl,Iy available in the Penn Tree 
Bank. This empirical finding verifies the in- 
dependence assumption widely made in prac- 
tice in statistical natural language processing. 
We proposed to use dependency forests to repre- 
sent case frame pa~terns. It is possible that more 
complicated probabilistic dependency graphs like 
Bayesian networks would be more appropriate for 
representing case frame patterns. This would re- 
quire even more data and thus the I)roblenl of 
how to collect sufficient data would be.a crucial 
issue, in addition to the methodology (ff learning 
case frame patterns as probabilistic dependency 
graphs. Finally the problem of how to determine 
obligatory/optional cases based on dependencies 
(acquired fi'om data.) should also be addressed. 
References 
Eric Bril\] and Philip Resnik. 1994. A rule-based 
approach to prepositional phrase attaclunent 
disantbignation, lb'occediT~gs of the 15lh COl,- 
\[N(;, pages 1198 -1204. 
C,.K. Chow and C',.N. Liu. 1968. Approximat- 
ing discrete probability distributions with de- 
pendence trees. \[NEE Transaclions on \[nfor- 
marion Theory, t4(3):,t62 467. 
Donald Hindle and Mats Rooth. 1991. Structural 
ambiguity and lexical relations. Proceedings of 
the 29th ACL, pages 229- 236. 
Hang Li and Naoki Abe. 1995. Generalizing case 
frames using a thesaurus attd the MDL princi- 
ple. Proceedings of Recent Advances in Nalural 
Language Processing, pages 239--248. 
Judea Pearl. 1988. Probabilistic Reasoning in In- 
telligent Eyslems: Networks of Plausible Infer- 
euce. Morgan Kauflnann Publishers Inc. 
Jorma Rissanen. 1989. Slochastic Complexily in 
5'talistical Inquiry. World Scientific Publis}ting 
Co. 
Joe Suzuki. 1993. A construction of bayesian net- 
works fi'om databases based on an MDL princi- 
ple. Proceedings of Uncerlainty in A\[ '92. 
11_5 
