File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1029_metho.xml
Size: 19,816 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1029"> <Title>A Class-based Probabilistic approach to Structural Disambiguation</Title> <Section position="3" start_page="0" end_page="194" type="metho"> <SectionTitle> 2 The Input Data and Semantic Hierarchy </SectionTitle> <Paragraph position="0"> The data used to estimate the probal)ilities is a multiset of 'co-occurrence triples': a noun IWe use italics when referring to words, ~Lnd angled bra.ckets for concepts. This notation does not alwa.ys pick out a concept uniquely, but the context should make cle~n: the concept being referred to.</Paragraph> <Paragraph position="1"> \]enltna., verb len, ina, and argunien/, i)osition. 2 l,et the li;ilivc.rso of verbs~ arglllilent positions aud ,lOtlBS tha.t call appear in the input data. be denoted &quot;l) = { &quot;Vl~... ~ 'vkv }~ &quot;\]~., :</Paragraph> <Paragraph position="3"> tiw:ly. Sucll data c~ui I)e obtMned fro,,, a. treeban,s, or from a. shallow pa:rser. Note tha.t we do lie, distinguish I)etwee, i a.lternative seiises of ve,'\])s~ a,lld assUIfle tha.t each \]llsta.llCe O\[ a. no,ill i,I tile data, refel?s to exactly olle conc, ept.</Paragraph> <Paragraph position="4"> The sei-i-la.ii{ic hiel:a.rchy used is the tie,Ill hy\[)efltylIl ta.xo:nonly of \/Vo,'dNet (vel'SiOll \]..6). :3 l,e~ (7,' = { el,..., Chc } be tile sot; of concepts in WordNet (lq: ,-~ 66,000). A concel,t is rel)resented in \Y=ordNet 1,y a synset: a sel o1' syilolly,nous words which cat, I)<7: used to (lenotc lha.1 c.oncei)t. I!kil7 exa.nll)iO ~ the COliC;el)l, ~co('.a.iile~ a.s iii the (\[rtlg~ iS represented l)y l.he following synset: {cocaine, co(-ai't~, coD(;, .,no'w, C}. l,et syn(c) C ;V l,e the syllscl; I'or the (:olicel)t c, a.d let c,,(,,.) - { c I&quot;&quot; m sy,,(+:) } I,e the. set of concepts that (;a, ii be denoted by the llO,lli 17..</Paragraph> <Paragraph position="5"> The \]liera.rc\] U has the stl'llCtlll'e O\[ a directed acyciic gi'a.l,\]i : althougli the nunll)er of nodes in the gra.ph wil;h lllO,'e 1,ha.l/ olle \])al'elll. is only a,rOtllid cite i)er(tent of the total. The edges ill the graph \['orni what we call the dii'ecl.-isa rela--Lion (dire('.t-is;,. C C' x {'). l,et isa = dii'ect-isa. X be ,lie tI'a, llSitivc~ re\[lexive (;\]OStlre Of (lirect-isa, so t\]iat (c/, c) ~ \]sa :=> c is a \]iy\])ernynl o;\[' (/; a.nd let ?~ = { c' \[(c',c) misa. } I)e the set consisting of the concept c and all of its hyponynis. Thus, the set (*ood) conta.ins all the concel)ts wtiich :q,re kinds of food, inclllditig (food).</Paragraph> <Paragraph position="6"> Note tha.t words in the data can al)pear in SyllSCts a, liy\vh01'0 ill the hiel;archy. Even COllcel)ts Sllch a.s (entity)~ which a,l)pe.ar 1,ear t\]ie l:OOt el&quot; the hierarchy, have synsets containing words which may a.pl)ear hi the da.ta. The synset for (entity) is {c'ntitg, something}, and the words cntit;q a.nd something (';/,ll apl)ear in the a.rgulnent positions of verbs in the data.</Paragraph> </Section> <Section position="4" start_page="194" end_page="195" type="metho"> <SectionTitle> 3 Probability Estimation </SectionTitle> <Paragraph position="0"> The problem being a.ddressed in this section is to estimate p(civ, r), for c C C, v < P, and concepts ill WordNet's nomi ta, xonomy. r C 'R.. The I~roba.bility p(eiv ,r) is the 1)robability tha.t some lie\ill in syn(c)~ when dellOf itlg coneet)t c, appears in position 'r of&quot; verl) 'v (given r a.nd v). Using the relative clause exanlpie fl'Otil tile hi,reduction> the p,'obal)ilities p((dog}lru,z,subj ) a, nd p((prize)lrv.',z,sub.i ) ca.n be toni_pared to decide on the attachluent site fl i,i I red awa'rded a p?'izc Jot iitc dog t/tat ran ~l,; f,.~:~;.,.:. We expe~t 1'((dog) l&quot;&quot;', ~m,.i) to 1,e grea.ter than p((pr+-ze) l'l'zt'~z,subj). Althougt, the \['OCllS iS O11 7,(c1~,,,), the tochniqlies described here can be used to estimat, e other lm)babilities, such a.s p(c, fly ). (in fa.ct, the latter prol)alfiiil, y is used hi the Pl>-aPSta.clinient CXl)erhnents de.qcril)ed in Section 5.) Using n, axinlun/ likelihood to cstinia.tc \])(C\['V, '/')iS iiot via.hie 1)eca.use of the liugc nuillt)er of I)al'al,l or.ors i\]i vo\] red. ~/lally COl IIt,i II a.tiolls O\[ C, 'l) a.lld 7' will /lot OCCIII&quot; in the data. '1'o reduce the ntii,il)ei' of i)a.ranletcrs which need to \])e estima.ted~ we utilise tile fa.ct tha.t COllCepts Call be grouped ini;o cla.sses, a.nd \]'epresent C IlSing a class (/, for some hypernynl c' of c. Ilowever, p(c'lv , r)ca.nnot be used as a.n estiniate of v((l,,, ,'), as V((:I&quot;', &quot;') is give. l,y the foliowi,,g:</Paragraph> <Paragraph position="2"> The probal)ility ~)((&quot;l'~&quot;) i,,c,'eases as c' moves up the hiera.rchy. For example, s)((eooa)l,;aZ,oi,.i ) is ,,or a good estiu,a,te of 1,((chicken)leat,obj ). What can be done though, is to condition on sets of concepts, and use the probability p(v\[c', r). If it ca:n be shown tha.t p(v\[c', r), for some hypernym c' of c, is a.</Paragraph> <Paragraph position="3"> reasoliable estilnate o1' v(vlc, v), then we have a.</Paragraph> <Paragraph position="4"> wa.y of estiniati,ig p(clv, r). To get \])(vie; , r)from l,(dv, ,') i~ayes ,',ie is ,,sed: p(4*,,,') p(vl~, &quot;'v(cl'') = 7)~ The prol)abilities p(clr ) and p(v\[r) cm~ be estimated using maximum likelihood esti,n~tes, a.s the conditioning event is likely to occur often enough for sp;tl'se data not to be a problem.</Paragraph> <Paragraph position="5"> (Alternatively ()tie could ha.ok-off to p(c) and ply) respectively, or use a. linear combhia.tion of p(d',)a.,d \],(c), ,.,d P0,1v)a,d V('0, ,'espoctively.) The formula.e for these estimates will I,e give,, shortly. This only leaves plY\[c, r). The proposM is to estilnate P(eatl(~h+-cken>, oh j) using p(eat\](food), oh j), or something similar. The following proposition shows that if p(vlc&quot; , r) is I the same for each c&quot; in c', where c' is some hypernym of c, then p(v\]c', r) will be equal to</Paragraph> <Paragraph position="7"> So in order to estimate p(v\[c,r), we need a way of searching for a set c', where c' is a hypernym of c, which consists of concepts c&quot; which have similar p(v\]c&quot;, r). Of conrse we cannot expect to find a set consisting of concepts which have identical p(vlc&quot;, r), which the proposition strictly requires, but if the p(vlc&quot; , 7&quot;) are simila.r, then we can expect p(vld , r) to be a. reasonable estimate of p(vlc , 7&quot;). We refer to the set c' as the %imilarity-class' of c, and the suitable hypernym, c l, as top(c, v, r). The next section explains how we determine similarity classes. The maxim.urn likelihood estimates for the relevant probabilities m:e given in Ta.ble 1.4</Paragraph> </Section> <Section position="5" start_page="195" end_page="196" type="metho"> <SectionTitle> 4 Finding Similarity-classes </SectionTitle> <Paragraph position="0"> First we explain how we determine if a set of concepts has similar p(vlc&quot;, r) for each concept c&quot; in the set. Then we explain how we determine top(c, v, r).</Paragraph> <Paragraph position="1"> 4Since we are a.ssuming the data. is not sense disa.mbiguated, f,:eq(c, v, r) cannot be obtained by simply counting senses. The standard approach, which is adopted here, is to estimate fl'eq(c, v, r) by distributing the count tor each noun n in syn(c) evenly among all senses of the noun. Yarowsky (1992) and \]{esnik (1993) explain how the noise introduced by this technique tends to dissipate as counts are passed up the hierarchy.</Paragraph> <Paragraph position="2"> freq(c, v, r) is the number of (n, v, r) triples in the data in which n is being used to denote c.</Paragraph> <Paragraph position="4"> Tile method used for comparing the p(vlc&quot; , r) for c&quot; in some set c', is based on the technique ill Clark and Weir (1999) used for tinding homogeneous sets of concepts in the WordNet noun hierarchy. Rather than directly compare estimates ofp(vlc&quot; , r), which are likely to be unreliable, we consider the children of c', and use estimates based on counts which have accumulated</Paragraph> <Paragraph position="6"> a.pproximation, but if the p(vlc}, r) arc similar, I then we assume that the p(vlc&quot; ,r) for c&quot; in c' are similar too.</Paragraph> <Paragraph position="7"> To deterlnine whether the children of some * . , ./is the hyperny,~ c' have simila,' \]'('~'14) where c~ ith child, we apply a X 2 test to a contingency tM)le of frequency counts. Table 2 shows some exa.mple frequencies for c' equM to (nutriment), in the ol)ject position of cat. The figures in brackets are the expected values, based on the marginal totMs in the table. The null hypothesis of the test is that p(vl@ r)is the same for each i. libr TM)1e 2 the null hypothesis is tlmt ,I tbr every child, ci, of (nutr+-ment}, the probability p(catlc~, obj) is the same.</Paragraph> <Paragraph position="8"> The log-likelihood X 2 statistic corresponding to TM)le 2 is 4.8. The log-likelihood X 2 statistic is used rather than the Pearson's X 2 statistic because it is thought to be more appropriate when the counts in the contingency table are low (\])unning, 1993). This tends to occur when the test is being applied to a set of concepts near the foot of the hierarchy, s We compared 5Fisher,s exa.ct test could be used for tables with low counts, but we do not do so because tables dolninated by low counts are likely to have a. high percentage of noise, due to the way counts for a noun are split ~unong</Paragraph> <Paragraph position="10"> the l)erformance of log-likelihood X 2 and Pearson's X ~2 using the l>P-~tttaehment experhnent described in Section 5. It was found that the log-likelihood ~2 test; did perform slightly bett('r. \]&quot;or a signitic~nce lew;I ot' 0.05 (which is the level used in the exl)eriments), with 4 degrees of freedom, the critical wdue is 1,1.86 (llowell, ;1!197). Thus in this ca.se, tlle null hyl~othesis would not be rejected.</Paragraph> <Paragraph position="11"> In order to determine top(c, v, r), we conlpare l,(vl~7, v) re,: the children of the hypernyms of c. hlitially top(c, 'v, r) ix assigned to I)e the coneet)t c itself. Then, l>y worldng Ull the hierarrclly, top((:, 'V, r) is reassigned to I)(' successive hyl)ernyms of c until the siblings of tol)(C , ~7+ 7')have siglfifi(:a.ntly different prol)abilities. In cases where a. concept has more than one I)a.J'ent, the parent is chosen which results in tile lowest :\~2 wflue as this indicates the p(v\[U,r) are more simila.r. The set top(c,v,r) is the sinfi\]a.ritycla.ss of c t'or verb v and position r.</Paragraph> <Paragraph position="12"> Th(; next section provides evidence that tile technique for choosing lOl)(C , v, r), which we call the 'simihu'ity-class' technique, does select an appropriate level of generalisation.</Paragraph> </Section> <Section position="6" start_page="196" end_page="199" type="metho"> <SectionTitle> 5 Experiments using PP-attachment </SectionTitle> <Paragraph position="0"> ambiguity The l>P-atta.chme:nt problem we address considers 4-tuples of the form v,:,t,,pr, n2, and the l)robleln is to decide wllether tile prel)ositional phrase pr n2 attaches to the verl> v or the 71oun nl. For exatnl)le, in the following cas(; tim l)rol)lent is to decide whether alternative senses. YVe rely on the log-likelihood X ,2 test returning a, non-significant result in these cases. J)'om minister attaches to awaii or approvah a.wait apt)7'owd from minister We chose the l~P-attachn~ent l)roblenl beca.use P l>-attaehment is a perw,.sive form of ambiguity, and there exist sta.ndard training and text da.ta~ which ma.kes for easy comparisons with other a.pproache~s. This p7'oblenl has been tackled by a nu nlber of resea.rehers, lh'ill and Resnik (1994), Ratnal)arkhi et al. (\]994), Collins (1995), Zaw:el and l)aelemans (\] 997) all report results between 81% and 85%, with Stetina. and Nagao (\] 997) tel)erring a result of 88%, which matches lhe hunm,t+ l>erf'ornlan(;e on this task rel)orted by Ratnal>arkhi (% al. (199.'1).</Paragraph> <Paragraph position="1"> Althougll th(' l)l)-attachnwnt l)roblem has chara('teristics that n,a.ke it suita.ble for ('valua.t;ion, it; I)resents a inuch bigger sparse data. t)\]:ol)le, m tlla.n would 1)e exl)ected in other l)roblems such as relative (:lausc atSadlment. The reason for this is that we need 1;(7 cot,sider how ~l C()l~ -Cel)t is associated with combi~zations of predicates and prel)ositions. T\]le al)proach described 11(;7&quot;(; uses prolml)ilities of the Ibrnl p(c, prlv ) ,u,d ~,,(c.z,,l,,.,), who,;o ,~ ~ ~,l(,+~). This .lea,is that for many predicate/prel)osition combinations which occur infl'equently in the d~ta., there are few examples of n2 which ca.n be used lot populating Wo7'dNet in these cases. Despite this, we were still able to carry out an ewl.luation by considering subsets of the test (ta.ta for which the relewmt predicate~preposition com-I)inations did occur frequently in tit(; training d at a,.</Paragraph> <Paragraph position="2"> We deckle on tile a.tta('hnmnt site by compar- null ing p(c~, pr\[v) and p(c,~,, p,'\],q), where = a rg n ax l,(c,p,'lv) c,z 1 = arg max p(c, prlTq ) The sense of n2 is chosen which maximises the relevant probability in each potential attachment case. If p(c,,,p,jv)is greater than 1)(%, :m'l~l), the attachment is made to v, otherwise to nl. If n2 is not in WordNet we compare p(prlv ) and p(prl~t~). Probabilities of the form p(c, prlv ) and p(c, prl~tl ) are used rather than p(clv,pr ) and p(cl~l,p,j, because the association between the preposition and v and ~q contains useful information. In fact, for a lot of cases this intbrmation alone can be used to decide on the correct attachment site_ The original corpus-based method of \]Jindle and ll.ooth (1993) used exactly this information. Thus the method described here can be thought of as Hindle and Rooth's method with additional class-based information about n2.</Paragraph> <Paragraph position="3"> In order to estimate p(c,,,prlv)(and p(C,~l,ln'l,,,,)) we apply the same procedure as described in Section 3, first rewriting the probability using Bayes' rule: p,,.)p(c,,, p,.) p(c,,,j,,lv)-- p(vlcv, v(v) p,.) !'(P'q c,, ) : p(dc,,, l,(v) The probabilities p(c.~) and p(v) can be estimated using maximum likelihood estimates, a.nd p(vlcv, p,' ) and j,(p,'lc,) can be estim.ated using maximum likelihood estimates of p(vltop(c~ ,v,p,'),pr) and p(prltop(%,pr)) respectively. 6 We used the training and test data described in l/.atn.aparkhi et al. (1994:), which, was taken Doln the Penn %:eebank and has now become the standard data set for this task. The data set consists of tuples of the form (v, ~zl, p~', n2), together with the attachment site for each tuple. There is also a development set to prevent implicit training on the test set during development. \~e extracted (v, pr, '~2) and (hi, pr, ,z2) ~ln Section 4 we only gave the procedure for determining top(c~, v, pr), but top(c~, pr) can be determined in an analogous fashion.</Paragraph> <Paragraph position="4"> triples from the training set, and in order to increase the number of training triples, we also extracted triples Kern unambiguous cases of attachlnent in the Penn %'eebank. We preprocessed the training and test data by \]emmatising the words, replacing numerical amounts with the words ~definite_quantity', replacing monetary amounts with the words 'sum_olLmoney' etc. We then ignored those triples in the resulting training set (but not test set) for which 7z2 was not in WordNet, which left a total of 66,881 triples of training data.. The test set contains 3,097 examples.</Paragraph> <Paragraph position="5"> Table 3 gives seine examples of the extent to which the similarity-class technique is generalising, using the training data just described, and a significance level of 0.05.</Paragraph> <Paragraph position="6"> The chosen hypernym is shown in Ul)per case. Note that the WordNet hierarchy consists of nine separate sub-hierarchies, headed by such concepts as (entity>, (abstraction), (psychological~eature), bnt we assume the existence of a single root which dominates each of the sub-hierarchies, which is referred to as (root>. In cases where WordNet is very sparsely populated, it is preferable to go to (root), rather than stay at the root of one of the sub-hierarchies where the data may be noisy or too sparse to be o\[' any use. The table shows that with the amount of data ava.ilable from the Treebank, the similarity-class technique is selecting a. level at or close to (root> in many cases.</Paragraph> <Paragraph position="7"> We compared the similarity-class technique with fixing the level of generalisation. Two tixed levels were used: the root of the entire hieraJ'chy ((root>), and the set consisting of the roots of each of the 9 sul>hierarchies. The procedure which always selects (root} ignores any information about ~z2, and is equivalent to comparing p(prlv ) and p(prl,h), which is the ltindle and Rooth approach. The results on the 3,097 test cases are shown in Table 4. We used a. significance level a of 0.05 tbr the X 2 test. r As the table shows, the disambiguation accuracy is below the state of the art. However, the results are comparable with those of l,i and rSimilar results were obtained using alternative levels of signifiea.nce. Rather than simply selecting a value for a, such as 0.05, a' can be tree,ted as a parameter of the model, whose optimum value caJl be obtained by running the disambiguation method on some held-out supervised data.</Paragraph> <Paragraph position="9"> Abe (119!)8) who a.dol)t a similar a.l>proa(:h usi:ng \VorclNet, but with a, differ<rot training and test set. I,i a.nd Abe iml>rOVed on the l\[\]ndie and Rooth techni(lue l)y 1.5%, whh;h is i, line with our results. As a.n evahla.tion of the simibu'ity-class tec\]lnique, the result is inconclusive. The rca.son for this is tha.t when the technique wa,s being used to estima.te \])( vlc,,, \])r ) a.Hd P(?~.:I \[c.,zl, I)?'), in many cases tile root o1&quot; 1lie hiera.rchy wa.s being chosen as the apl>rOl)riat;e level of genera.lisa.tion, due to a. sparsely popula.ted WordNet in tha.t insta.nce. Recall that this is la.rgely due to tit<', fa.ct that we a.rc a.ttemltting to popula.te WordNet fbr comltina.tions of predic~tes ~md prepositions. In such cases tile sinlil~u'ity-elass technique is not helping because there is very little or no informa.tion a.1)otlt ~,2. s aln an effort to obtahl more do.to, we a, pplicd the extraction heuristic of lla.tna.parkhi (1998) to \Y=all Street Journa.l text, which increased the nuntl)er of training triples by ~L factor of 111. '\['his only a.chievcd comparable results, however, presumably boca.use the high volume of noise in the dat~ outweighs the benefit of the increase in da.ta size. \]{.atnaparkhi reports only 69% a.ccuracy tot In order to eva.lua.te the similarity-class technique further, we took those test cases for which tile root wa, s not being selected when estima.ting bet:t, J,(,,I,,~. J,') .+,,d \])('/,.1 I~,,. pv). n:\],is .pplied to 113 c~ses. The results ~u;e given in Table 5.</Paragraph> <Paragraph position="10"> We a.lso took those test cases for which the root was I)eing selected when estimating +~t most one of p(v\[c+,,pr) a.nd p(,q \[c,~, pr). This a.pplie, d to \]032 test ca.sos. The results a.re shown in %> ble 6.</Paragraph> <Paragraph position="11"> the extraction heuristic when applied to the \]%nn Treeba.nk (excluding cases where the ln:eposition is of).</Paragraph> </Section> class="xml-element"></Paper>