File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1004_metho.xml

Size: 18,163 bytes

Last Modified: 2025-10-06 14:14:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1004">
  <Title>Learning Dependencies between Case Frame Slots</Title>
  <Section position="3" start_page="20" end_page="22" type="metho">
    <SectionTitle>
2 Probability Models for Case
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="20" end_page="22" type="sub_section">
      <SectionTitle>
Frame Patterns
</SectionTitle>
      <Paragraph position="0"> Suppose that we haw? data given by ills(antes of the case frame of a verb automatically extracted from a corpus, using conventional techniques. As explained in Introduction, the l:irol~lelu of learning case fraille l)atteriis ca.it be viewed as that of estilnating the unde~rlying mulli-dimemsioltal joilll distribulioT~ which giw~s rise to such data. 111 this research, we assume that &lt;'as(.' t}ame instances with the same head are generated by a joint distribution of type, I'~, (&amp;, X~,..., X,,), (:3) where index Y stands for the head, and each of the randonl variables Xi,/ = 1,2,..., n, represents a case slot. In this paper, we use 'case slots' to mean re,face case slots, and we uniformly treat obligatory cases and optional cases. 'rhus the muN)er n of the random variables is roughly equal to the nunfl)er of prepositions in English (and less than 100). These models can be further classified into three types of probability models according to the type of values each random variable. Xi assumes 2.</Paragraph>
      <Paragraph position="1"> When Xi assumes a word or a special symbol '0' as its value, we refl:r to the corresponding model Pv (Xi,. *., X,) as a 'word-based model.' Here '0' indicates the absence of the case slot in question.</Paragraph>
      <Paragraph position="2"> When Xi assumes a. word-class or '0' as its value, the corresponding model is called a 'class-based model.' When Xi takes on 1 or 0 as its value, we call the model a 'slot-based model.' Here the value of 'l' indicates the presence of the case slot in question, and '0' al&gt;sence. Suppose for simplicity that there are only 4 possible case slots (random variables) corresponding respectively to the subject, direct object, 'front' phrase, and 'to' phrase. Then, l'flv(X.,.,at = girl, X.,.g2 = jet, Xf,. .... = 0, X~o = O) (4) is given a specific l)robability value by a word-based model. In contrast,</Paragraph>
      <Paragraph position="4"> is given a specilic l)robability by a class-based ,nodel, where (l,e,'son) alid (airplane) denote~ word classes. Finally,</Paragraph>
      <Paragraph position="6"> is assigned a specific probability by a slot-based model.</Paragraph>
      <Paragraph position="7"> We then forlmllale the dependencies between case slots as the probabilislic dependencies between the randonl variabh~s in each of these three trtodcls. In the absence of any constraints, however, the number of parameters in each of the above three lnodels is exponential (even the slot-based model has 0(2&amp;quot;) parameters ), and thus it is infeasible to accurately estimate them in practice. A simplifying assumption that is often made to deal with this difficulty is that random variables (case slots) are mutually independent.</Paragraph>
      <Paragraph position="8"> Sul)pose for examl:ile that in the analysis of the setltellCe l saw a girl with a t.elescope, (7) two interpretatiolls are obtained. We wish to select. the nlore appropriate of the two in(eft:iterations. A heuristic word-based method for disambiguation, in which the slots arc assumed to be 2A representation of a probability distribution is usually called a probability model, or simply a model.  dependent, is to calculate tile following values of word-based likelihood and to select tile interpretation corresponding to the higher likelihood value.</Paragraph>
      <Paragraph position="10"> If on the other hand we a.ssume that the random variables are independe'~l, we only need to calculate and compare t~,:(X~,iH, = telescope) and Pgi,'t(.\'with = telescope) (c.f.(Li and Abe., 1995)). The independence assumption can also be made in the case of a class-based model or a slot-based model. For slot-based models, with tile independence assumption, P.~(X,~,ith = 1) and</Paragraph>
      <Paragraph position="12"> and Rool:.h, 1991)).</Paragraph>
      <Paragraph position="13"> Assuming that random variables (case slots) are mutually independent would drastically reduce tile number of parameters. (Note that. under the independence assuml)tion tile nmnber of parameters in a slot-based model becomes 0(~).) As illustrated in Section 1, t.his assumption is not necessarily valid in practice. What seems to be true in practice is that some case slots are ill fact dependent but overwhelming majority of t.hem a.re independent, due partly to the fa.cl that usually only a few slots are obligatory and most others are optional. :~ Thus the target, joint distribution is likely to be a.pproximabie by the product of several component distributions of low order, and thus have in fact a reasonably small number of parameters. We are thus lead to the approach of approximating tile tal:get joint distribution by such a simplified model, based on corpus data.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="22" end_page="22" type="metho">
    <SectionTitle>
3 Approximation by Dendroid
</SectionTitle>
    <Paragraph position="0"> Distribution Without loss of generality, any n-dinlensiorlal joint distribution can be writl.en as P(xi, x._, ..... x,,) = H P(x,,, IX ..... ....x%,_,)</Paragraph>
    <Paragraph position="2"> for some pernnttation (mq, m._, .... nb~ ) of 1, 2 .... n, here we let P(X,~,I x ..... ) denote FIX,,,,).</Paragraph>
    <Paragraph position="3"> A pta.usib\[e assumption on I.he dependencies between random variables is intuitively that each variable direetbj depends oil at most one other variable. (Note that this assumption is tile simplest among those that relax the independence a.ssumption.) For example, if a joint distribution P(X1, X,,, X:3) over 3 random variables X1, X2, Xa aOptiona.1 slots ~tre not necessarily independent, but if two optional slots are randomly selected, it is likely that they are indet)endent of one a.nother.</Paragraph>
    <Paragraph position="4"> can be written (approximated) as follows, it (al&gt; proximately) satisfies such an assumption.</Paragraph>
    <Paragraph position="6"> Such distributions are referred to as 'dendroid distributions' in tile literature. A dendroid distribution can be represenled by a dependency forest (i.e. a set of dependency trees), whose nodes represent the random variaMes, and whose directed arcs represent the dependencies that exist between these random w/riahles, each labeled with a number of parameters specil}'ing the probabilistic dependency. (A dendroid distribution can also be considered as a re.stricted form of the Bayesian Network (Pearl, 1988).) It is not difficult t.o see tha.t there are 7 and only 7 such representations for the joint distribution P(X1, X,2, X3) disregarding the actual nmnerical values of t.he probability parameters.</Paragraph>
    <Paragraph position="7"> Now we turn to the problem of how to select the best dendroid distribution fi:om among all possible ones to approximate a target joint distribution based on input data generated by it. This problem has been inw?stiga.ted in the area of machine learning and related fields. A classical method is Chow &amp; Liu's algorMnn for estimating a nmltidimensional .joint distribution as a dependency tree, ill a way which is both el-~cient and theoretically sound (C.how and I,iu, 1968). More recent.ly (Suzuki, 1993) extended their algorithm so that it estimates the target ,joint. distribution as a dependency Forest. or 'dendroid distrihution', allowing for the possibility of learning one group of random variables to be completely independent of another. Since nlany of the random variables (case slots) in case flame patterns are esseutially independent, this feature is crucial in our context, and we thus employ Suzuki's algorithm for learning our case frame patterns. Figure 1 shows the detail of this Mgorithm, where ki denotes the nun&gt; her of possible values assumed by node (random variable) Xi, N the input data size, and qog' denotes the logarithm to the base 2. It is easy to see that the nulnber of parameters in a dendroid distribution is of the order O(k2ne), where k is the maxinmni of all ki, and n is the. number of random variables, and the time complexity of the algorithm is of the same order, as it is linear in the number of parameters.</Paragraph>
    <Paragraph position="8"> Suzuki's algorithm is derived from the Minimum Description Length (MDL) principle (liissanen, 1989) which is a principle for statistical estimation in information theory. It is known that as a. method of estimat.ion, MI)L is guaranteed to be near optinm.l 4. \[n applying MDL, we usually assume that the given data are generated by a probability model t.hat belongs to a certain class of models and selects a model within tile class which  usua.Hy t;hal, a simllh'r model has a l)oor(,r Ill t.o 1,he dal.a, a.H(/ a nlore complex mo(hq h+ts a l+(,i,l,(q: fil I lIO I'll() (la.t.a. Thus t,h('l:e is n t.rad('-ofI' I&gt;ctw(&gt;cn t,t&gt; simplicity of a mod(q gum l.h(' go(&gt;dn('ss of lit. to data. M1)I, resolves I.his I.ra(h~-&lt;)\[l' in a (lis('il&gt;ti\[&gt;d way: 11. s(eh,cl.s a Illod('l which is i'(msonably silu-I/l(&gt; a.nd fits l.he data sal.isl&amp;quot;acl.orily as w('\[l. In our ('lil'I;('l/l prol)l(ml, a :-;iltlI)\](? IHod('l iil(:;tl/S ;t IIIC)(\[('I wil.h less d(q)('l~(l(mcies, and thus Nil)l, l)rovi(l(,.-, a (h(?or(q.ic;dly sound way 1.o learn ()Ill N Ihosc &amp;, pcq\]dcncies thai, arc sl.al.isticMly signilicant in Ill(: given (\[al;a. Air esp(~c\[;dly iJll('t'(,s(iug \[}~alur(~ of MIll, is l\]lal it. incorl)orat.es l:he il\]l)tll, dala size in it.s model soh&gt;ct.ion crit.crion. 'l'his is rcfl~'('led, in our (u~,s(&gt;, in t.hc &lt;terival.i(&gt;n ()l' 1,h(' thr('sh(,hl O. Nol.e l, haI, wh(m wc (lo not, \]l;iv(~ enough data (i.e. \[or smfll N), the thr(&gt;shohls will b(' large and Ibw nodes Icn(I 1.o 1)c Iinlccd, rcsuliillg ill a sil\]lpie mod('l ill which most. o\[ t,l&gt; ('as(&gt; tTr+m&gt; slots arc ,jtt(lgc'd in(h':l)(m,,hml.. This is r(uts(&gt;na.lA(, since with a small data. size most cas,, slot&gt; cam\]oi I)(, degermin(xl i.o I)c dep(m(h-\]tt with a.uy significance.</Paragraph>
  </Section>
  <Section position="5" start_page="22" end_page="24" type="metho">
    <SectionTitle>
4 Experimental Result;s
</SectionTitle>
    <Paragraph position="0"> more (,hmi 50 cas(~ frame examph~s appeared in lira lraining data.</Paragraph>
    <Paragraph position="1"> lqrsl, wo acquit&gt;d l,hc sloi-bascd case flame paliOI'|IS for ;Ill of t.he 357 verbs. \,'V(~ \[lll(~ ii (~()t~ (l ~ \[(: to(l ~ I,cwfohl cross va\]idai, ion to cva\[uai,e tlw %esI, data p(u:ph~xii,y' of t,/w acquired case frame pat, terns, that is, wC/~ used nine l,(ml, h o\[ the case flames %r each verb as training dat,a (saving what, rema.ins as t, es(, data), t,o acquire case flame pai, l, erns, and then calculalcd pCrl)lexil. Y= using the lesl, data. VV(&gt; rel&gt;Catc'd this procoss t.cn lim(~s a.nd calculated tlm ;tvcragc l)Crl)lexity. '\['able I shows the average perplexit.y ()btmm'd for some randomly s('h'ctcd verbs.</Paragraph>
    <Paragraph position="2"> \Ve also calculat.cd tim av(u:age perplexil.y of the qndcpcndettt, slof nlodcls' acquired bas(~d on 1.h(' assumpt, ion t, hal. (~ach slof is hMepcmhml,. Our ex l)crimenl,al rcsull, s shown in 'l'able 1 indicate (hal 1.he use o\[ t.he +'ndroid models can achieve up t.o 2()~. pcrpl(~xil:y reducl ion as COmlmt'ed ~o the imb\[)Ol|d(Hll, slot ll\](,)(\[OIS. It scorns sail&amp;quot; lo say lhere\['ore thai the dendroid utodcl is more sttitablc I'or rcprcscnl:ing the Ira+ model o\[' case flames than l.\[w hMq)emlcnl s\]ol. lttOdO\[.</Paragraph>
    <Paragraph position="3"> \Vc also used lhe acquir(&gt;d depend(racy knowlc+,{gc ill a pl&gt;at, l, achmenl, disambiguai.ion experii\]lol\]l., kV(' used the case h'an\]~s of' all 357 verbs as o\]tr t.raining dat.a. Wc used Chc cttl:irc + bracketc&lt;l corpus as Iil'a.illillg dat.a it\] part: because wc wanl.ed t.o utilize as many t.raining data as possible. We ext.ract.(&lt;l (c~ rb, ,ou?q, prep, ?,)tt?~2) or ( v(,A,, t.'cpt, ~otml, prr p.2, ~ou~\]2) pat.terns \['rotlt the \VSJ tagged ('orplts ;ts i,est. (lata, ItSillg pa.ttc\]'n matching tccl!t\]iqucs. \Vc t.ook care to ensure ihal otlly t, hc part. o\[' l\[w (agg('d (non-l)rackt,t.cll) corlms which do(,s not ov('rlap xxit.h the I)rack('l,('(I corptlS is tlSC(I a,'&lt; test. dai.a. (The bracl,:(,ted corlms lots overlap wii.h i)arl, o\[ the t,ttgg~x:\[ (orpus.) \Vc acquired ('aso \[ratne pal t.crns using t, hc |.raining da, ta. \V~ found l:hai there were 266 v&lt;wl&gt;s, whose 'arg2' slot is (tel'~(qtdc'lll Ol1 SOl\]tO.</Paragraph>
    <Paragraph position="4"> of i, hc ot,lwr preposition slots. 'l'hm'v were 37 (Se~' exmr@es in 'l'al)lc 2) verbs whose depen(h&gt;l\]cy I)cl,w(&gt;en ;u:g2 and ol, hcr slots is positAv(,</Paragraph>
    <Paragraph position="6"> by our method seem to agree with human intuition in most cases. There were 93 examples in  the test data ((verb, nounl,prcp, no'an2) pattern) in which tile two slots 'a.rg2' and prep of verb are determined to be positively dependent and their dependencies are stronger than tile threshold of 0.25. We forcibly attached prep nou~t2 to verb for these 93 examples. For comparison, we also tested the disambiguation method based on the independence assumption proposed by (Li and Abe, 1995) on these examples. Table 3 shows the results of these experiments, where 'Dendroid' stands for the former method and 'Independent' the latter. We see that using tile information on dependency we can significantly improve the disambiguation accuracy on this part of the data Since we can use existing methods to perform disambiguation for the rest of the data, we can improve the disambiguation accuracy for the entire test data using this knowledge. Furthermore, we found that there were 140 verbs having inter-dependent preposition slots. There were 22 (See examples in Table 4 ) out of these 140 verbs such that their ease slots hawe positive dependency that exceeds a certain threshold, i.e.</Paragraph>
    <Paragraph position="8"> pendencies found by our method seem to agree with human intuition. In the test data (which are of verb,prep:t,nount,prep~, nou~ pattern), there were 21 examples that involw? one of the above 22 verbs whose preposition slots show dependency exceeding 0.25. We forcibly attached bot.h prep, no'unl and prep2 noun2 to verb on these 21 examples, since the two slots prept and prep~ are judged to be dependent. Table 5 shows the results of this experimentation, where 'Dendroid' and 'Independent' respectively represent  the method of using and not using the knowledge of dependencies. Again, we found that for the part of the test data in which dependency is present, the use of the dei)endency knowledge can be used to improve the accuracy of a disambiguation method, Mthough our experimental results are inconclusive at this stage.</Paragraph>
    <Section position="1" start_page="22" end_page="24" type="sub_section">
      <SectionTitle>
4.2 Experiment 2: Class-based Model
</SectionTitle>
      <Paragraph position="0"> We also used the 357 verbs and their case frames used in Experiment 1 to acquire class-based case frame patterns using the proposed method. We randomly selected 100 verbs among these 35r verbs and attempted to acquire their case frame patterns. We generalized the case slots within each of these case frames using the method proposed by (Li and Abe, 1995) to obtain class-based case slots, and then replaced the word-based case slots in the data with the obtained class-based case slots. What resulted are class-based case frame examples. We used these data as input to the learning algorithm and acquired case frame patterns for each of' the 100 verbs. We found iJmt no two case slots are determined as dependent in any of the case frame patterns. This is because the number of parameters in a class based model is very large compared to the size of the data we had available.</Paragraph>
      <Paragraph position="1"> Our experimental result verifies the validity in practice of the assumption widely made in statistical natural language processing that class-based case slots (and also word-based case slots) are mutually independent, at least when the data size available is that provided by the current version of the Penn Tree Bank. This is an empirical finding that is worth noting, since up to now the independence assumption was based soMy on hu- null man intuit, ion, to the best of our knowledge. To test how large a data size is required to eslimate a class-based model, we conducted the following experiment. We defined an artifMal class-based model and genera.ted some data. according to its distribution. We then used the data to estimate a class-based model (dendroid distribution), and evaluated the estimated model by measuring the mlmber of dependencies (dependency arcs) it has and the KL distance between the estimated model and the true model. We repeatedly generated data and obserwed the learning 'curve', nan,ely the relationship between the number of dependencies in the estimated model and the data. size used in estimation, and the relationship betweett the KI, distance between the estimated and true modols and the data size. We defined two other models and conducted the same experiments. Figure 2 shows the results of these experiments for these three artificial models averaged ower tO trials. (The number of parameters in Modell, Model2, and Model3 are 18, 30, and 44 respectiv(_'ly, while the number of dependencies are 1, 3, aud 5 respectively.) We see that to accurately estimate a model the data size required is as large as 100 times the nmnber of parameters. Since a class-based mode\[ tends to have more than 100 parameters usually, the current data size available in the Penn Tree Bank is not enough for accurate estimation of the dependencies wilhin case fi'antes of most verbs.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML