A Class-based Probabilistic approach to Structural 
Disambiguation 
Stephen Clark and David Weir 
School of Cognitive and Computing Sciences 
University of Sussex 
Brighton, BN1 9IIQ, UK 
{ st ephec:l_, david~r}Ocogs, susx. ac. uk 
Abstract 
Knowledge of which words are able to fill p~rtic- 
ular argum.ent slots of a predicate can be used 
tbr structural disambiguation. This paper de- 
scribes a proposal :for acquiring such knowledge, 
and in line with much of the recent work in this 
area, a probabilistic approach is taken. We de- 
velop a novel way of using a semantic hierar- 
chy to estimate the probabilities, and demon- 
strate the general approach using a preposi- 
tional phrase atta.chment experiment. 
1 Introduction 
Knowledge of which words are able to fill 
particular a.rgument slots of a. l?redlca.te ca,n 
be used tbr structural disa.mbiguation. In 
the following example (Charnial~, 1993), the 
fact that dog, rather than prize, is often 
the su.1)ject of r'lm, can t)e used to decide 
on the attachment site of the relative clause: 
Fred awarded a prize for the dog that ran the fastest 
We describe a proposal for acquiring such 
knowledge, and as in other recent work in this 
area (Resnik, 1993; l,i and Abe, t998), a prob- 
abilistic approach is taken. Using probabilities 
accords with the intuition that there are no ab- 
solute constraints on the arguments of predi- 
cates, bu.t rather that constraints are satisfied 
to a certain degree (Resnik, 1993). Unfortu- 
nately, defining probabilities in terms of words 
leads to a model with a vast number of param- 
eters, resulting in a sparse data problem. To 
overcome this, we propose to define a probabil- 
ity model in terms of senses from a semantic hi- 
erarchy, exploiting the fact that senses of nouns 
can be grouped together into semantically sim- 
ilar classes. 
We use the semantic hierarchy of noun senses 
in WordNet (Fellbamn, 1.998), which consists of 
qexicalised concepts' related by the qs-a-kind- 
of' relation. If c' is a kind of c, then c is a hy- 
pcrnym of c', and c' a hyponym of c. Counts are 
passed u.p the hierarchy fl'om the senses of nouns 
appearing in the data. Thus if cat chicl~cn a.p- 
pears in the data, th.e count for this item passes 
u\]) to (meat}, (good}, and all the other hyper- 
nyms of that sense of chicken. 1 In. order to es- 
timate the probability that a sense of chichcn 
~tppea.rs as the object of the verb cat, we repre- 
sent (chicken} using a. suitable hypern3qn , such 
as (:eood), and base our probability estimate on 
that instead. The level at which (chicken) is 
represented is cruciah it should be high enough 
for adequate counts to have accumulated, but 
not too high so that the hypernym is no longer 
representative of (chicken}. An exanlple of a 
hypernym whidl would be too high is (erttity), 
as not all entities are semantically similar with 
respect to the object position ot7 cat. 
The problem of choosing an appropria.te level 
in the h.ierarchy at which to represent a par- 
ticular noun sense (given a predicate and argu- 
ment position) has been investigated by Resnik 
(1993), Li and Abe (1998) and ll,iba,s (1995). 
The learning mechanism presented \]lore is a 
novel approach based on tinding semantically 
similar sets of concepts in a hierarchy. We 
demonstrate the effectiveness of our approach 
using a PP-attachment experiment. 
2 The Input Data and Semantic 
Hierarchy 
The data used to estimate the probal)ilities is 
a multiset of 'co-occurrence triples': a noun 
IWe use italics when referring to words, ~Lnd angled 
bra.ckets for concepts. This notation does not alwa.ys 
pick out a concept uniquely, but the context should make 
cle~n: the concept being referred to. 
194 
\]enltna., verb len, ina, and argunien/, i)osition. 2 
l,et the li;ilivc.rso of verbs~ arglllilent posi- 
tions aud ,lOtlBS tha.t call appear in the in- 
put data. be denoted "l) = { "Vl~... ~ 'vkv }~ "\]~., : 
{ ,,.,,..., ,.,,~ } a,d N = { ',~,,..., ",J,H }, ~'esi,e,- 
tiw:ly. Sucll data c~ui I)e obtMned fro,,, a. tree- 
ban,s, or from a. shallow pa:rser. Note tha.t we 
do lie, distinguish I)etwee, i a.lternative seiises of 
ve,'\])s~ a,lld assUIfle tha.t each \]llsta.llCe O\[ a. no,ill 
i,I tile data, refel?s to exactly olle conc, ept. 
The sei-i-la.ii{ic hiel:a.rchy used is the tie,Ill hy- 
\[)efltylIl ta.xo:nonly of \/Vo,'dNet (vel'SiOll \]..6). :3 
l,e~ (7,' = { el,..., Chc } be tile sot; of concepts 
in WordNet (lq: ,-~ 66,000). A concel,t is rel)re- 
sented in \¥ordNet 1,y a synset: a sel o1' syilolly- 
,nous words which cat, I)<7: used to (lenotc lha.1 
c.oncei)t. I!kil7 exa.nll)iO ~ the COliC;el)l, ~co('.a.iile~ 
a.s iii the (\[rtlg~ iS represented l)y l.he following 
synset: {cocaine, co(-ai't~, coD(;, .,no'w, C}. l,et 
syn(c) C ;V l,e the syllscl; I'or the (:olicel)t c, 
a.d let c,,(,,.) - { c I"" m sy,,(+:) } I,e the. set of 
concepts that (;a, ii be denoted by the llO,lli 17.. 
The \]liera.rc\] U has the stl'llCtlll'e O\[ a directed 
acyciic gi'a.l,\]i : althougli the nunll)er of nodes in 
the gra.ph wil;h lllO,'e 1,ha.l/ olle \])al'elll. is only 
a,rOtllid cite i)er(tent of the total. The edges ill 
the graph \['orni what we call the dii'ecl.-isa rela-- 
Lion (dire('.t-is;,. C C' x {'). l,et isa = dii'ect-isa. X 
be ,lie tI'a, llSitivc~ re\[lexive (;\]OStlre Of (lirect-isa, 
so t\]iat (c/, c) ~ \]sa :=> c is a \]iy\])ernynl o;\[' (/; a.nd 
let ?~ = { c' \[(c',c) misa. } I)e the set consisting 
of the concept c and all of its hyponynis. Thus, 
the set (*ood) conta.ins all the concel)ts wtiich 
:q,re kinds of food, inclllditig (food). 
Note tha.t words in the data can al)pear in 
SyllSCts a, liy\vh01'0 ill the hiel;archy. Even COll- 
cel)ts Sllch a.s (entity)~ which a,l)pe.ar 1,ear t\]ie 
l:OOt el" the hierarchy, have synsets containing 
words which may a.pl)ear hi the da.ta. The 
synset for (entity) is {c'ntitg, something}, and 
the words cntit;q a.nd something (';/,ll apl)ear in 
the a.rgulnent positions of verbs in the data. 
3 Probability Estimation 
The problem being a.ddressed in this section is 
to estimate p(civ, r), for c C C, v < P, and 
2Only verbs a.re considered here, but this work applies 
to other predicates which take a.rgunmnts that can bc 
orga.nised into a semantic hierarchy. 
3When wc refer (.o concepts ill \'VoMNct, wc nlea.n 
concepts ill WordNet's nomi ta, xonomy. 
r C 'R.. The I~roba.bility p(eiv ,r) is the 1)rob- 
ability tha.t some lie\ill in syn(c)~ when dellOf 
itlg coneet)t c, appears in position 'r of" verl) 'v 
(given r a.nd v). Using the relative clause ex- 
anlpie fl'Otil tile hi,reduction> the p,'obal)ilities 
p((dog}lru,z,subj ) a, nd p((prize)lrv.',z,sub.i ) ca.n 
be toni_pared to decide on the attachluent site 
fl i,i I red awa'rded a p?'izc Jot iitc dog t/tat ran 
~l,; f,.~:~;.,.:. We expe~t 1'((dog) l""', ~m,.i) to 1,e 
grea.ter than p((pr±ze) l'l'zt'~z,subj). Althougt, the 
\['OCllS iS O11 7,(c1~,,,), the tochniqlies described 
here can be used to estimat, e other lm)babilities, 
such a.s p(c, fly ). (in fa.ct, the latter prol)alfii- 
il, y is used hi the Pl>-a£ta.clinient CXl)erhnents 
de.qcril)ed in Section 5.) 
Using n, axinlun/ likelihood to cstinia.tc 
\])(C\['V, '/')iS iiot via.hie 1)eca.use of the liugc nuill- 
t)er of I)al'al,l or.ors i\]i vo\] red. ~/lally COl IIt,i II a.tiolls 
O\[ C, 'l) a.lld 7' will /lot OCCIII" in the data. '1'o re- 
duce the ntii,il)ei' of i)a.ranletcrs which need to 
\])e estima.ted~ we utilise tile fa.ct tha.t COllCepts 
Call be grouped ini;o cla.sses, a.nd \]'epresent C IlS- 
ing a class (/, for some hypernynl c' of c. Ilow- 
ever, p(c'lv , r)ca.nnot be used as a.n estiniate of 
v((l,,, ,'), as V((:I"', "') is give. l,y the foliowi,,g: 
E .( c'l,,, .,.) = v((:'l., ,) <:"C~ 
The probal)ility ~)(("l'~") i,,c,'eases as c' 
moves up the hiera.rchy. For example, 
s)((eooa)l,;aZ,oi,.i ) is ,,or a good estiu,a,te of 
1,((chicken)leat,obj ). What can be done 
though, is to condition on sets of concepts, and 
use the probability p(v\[c', r). If it ca:n be shown 
tha.t p(v\[c', r), for some hypernym c' of c, is a. 
reasoliable estilnate o1' v(vlc, v), then we have a. 
wa.y of estiniati,ig p(clv, r). To get \])(vie; , r)from 
l,(dv, ,') i~ayes ,',ie is ,,sed: 
p(4*,,,') p(vl~, "'v(cl'') = 7)~ 
The prol)abilities p(clr ) and p(v\[r) cm~ be esti- 
mated using maximum likelihood esti,n~tes, a.s 
the conditioning event is likely to occur often 
enough for sp;tl'se data not to be a problem. 
(Alternatively ()tie could ha.ok-off to p(c) and 
ply) respectively, or use a. linear combhia.tion 
of p(d',)a.,d \],(c), ,.,d P0,1v)a,d V('0, ,'espoc- 
tively.) The formula.e for these estimates will 
I,e give,, shortly. This only leaves plY\[c, r). The 
195 
proposM is to estilnate P(eatl(~h±cken>, oh j) us- 
ing p(eat\](food), oh j), or something similar. The 
following proposition shows that if p(vlc" , r) is 
I 
the same for each c" in c', where c' is some 
hypernym of c, then p(v\]c', r) will be equal to 
p(v\[c, r): 
-- I \],(~1c",7.) 
:/~ for all c" ~ c' ~ \],(vlc',7.)= 
The proof is as follows: 
,\],(~17") \],(vl7,7") = \],(71~,,)~ 
- ~'(L 17") ~ p(c"lv,7,) 
z,(c'l,') - clinG 1 
_ z,(~l~') V" \],(~ d', 7.~ \]'(c'17) 
z,(c~lT,) ~ ' ' p(~lT) 
_ 2 a, ~ p(c"l,0 p(c'lT,) _ 
Cling I 
= /~ 
So in order to estimate p(v\[c,r), we need a 
way of searching for a set c', where c' is a hy- 
pernym of c, which consists of concepts c" which 
have similar p(v\]c", r). Of conrse we cannot ex- 
pect to find a set consisting of concepts which 
have identical p(vlc", r), which the proposition 
strictly requires, but if the p(vlc" , 7") are simila.r, 
then we can expect p(vld , r) to be a. reasonable 
estimate of p(vlc , 7"). We refer to the set c' as 
the %imilarity-class' of c, and the suitable hy- 
pernym, c l, as top(c, v, r). The next section ex- 
plains how we determine similarity classes. The 
maxim.urn likelihood estimates for the relevant 
probabilities m:e given in Ta.ble 1.4 
4 Finding Similarity-classes 
First we explain how we determine if a set of 
concepts has similar p(vlc", r) for each concept 
c" in the set. Then we explain how we determine 
top(c, v, r). 
4Since we are a.ssuming the data. is not sense dis- 
a.mbiguated, f,:eq(c, v, r) cannot be obtained by sim- 
ply counting senses. The standard approach, which is 
adopted here, is to estimate fl'eq(c, v, r) by distributing 
the count tor each noun n in syn(c) evenly among all 
senses of the noun. Yarowsky (1992) and \]{esnik (1993) 
explain how the noise introduced by this technique tends 
to dissipate as counts are passed up the hierarchy. 
Table 1: Maximum Likelihood Esti:lnates 
freq(c, v, r) is the number of (n, v, r) triples in 
the data in which n is being used to denote c. 
fl'eq(c,r) Ev'EV freq(c,v',r) 
P(CI?') : "frcq(r) -- Ev'EVEdccfrcq(c',v',r) 
freq(v,r) Ec'Ec freq(ct,v,r) 
/)(VlT")- freq(r) : Zv,EVZc, ccfreq(c',v',r) 
\]}(vie w, 7") -- freq(c-i"v'r) Z~"c~77freq(c't'v'r) 
rreq(d,,-) = Ev,evE~,,~Tf,-eq(~",'~,',,-) 
Tile method used for comparing the p(vlc" , r) 
for c" in some set c', is based on the technique 
ill Clark and Weir (1999) used for tinding homo- 
geneous sets of concepts in the WordNet noun 
hierarchy. Rather than directly compare esti- 
mates ofp(vlc" , r), which are likely to be unreli- 
able, we consider the children of c', and use esti- 
mates based on counts which have accumulated 
I ,/ ,/ at the children. If c' has children Q,%,..., c,,,,, 
I we compare e'(~l<, ") for each i. Th~s is an 
I 
a.pproximation, but if the p(vlc}, r) arc similar, 
I then we assume that the p(vlc" ,r) for c" in c' 
are similar too. 
To deterlnine whether the children of some 
• . , ./is the hyperny,~ c' have simila,' \]'('~'14) where c~ 
ith child, we apply a X 2 test to a contingency 
tM)le of frequency counts. Table 2 shows some 
e×a.mple frequencies for c' equM to (nutriment), 
in the ol)ject position of cat. The figures in 
brackets are the expected values, based on the 
marginal totMs in the table. The null hypoth- 
esis of the test is that p(vl@ r)is the same for 
each i. libr TM)1e 2 the null hypothesis is tlmt 
,I tbr every child, ci, of (nutr±ment}, the probabil- 
ity p(catlc~, obj) is the same. 
The log-likelihood X 2 statistic corresponding 
to TM)le 2 is 4.8. The log-likelihood X 2 statistic 
is used rather than the Pearson's X 2 statistic 
because it is thought to be more appropriate 
when the counts in the contingency table are 
low (\])unning, 1993). This tends to occur when 
the test is being applied to a set of concepts 
near the foot of the hierarchy, s We compared 
5Fisher,s exa.ct test could be used for tables with low 
counts, but we do not do so because tables dolninated 
by low counts are likely to have a. high percentage of 
noise, due to the way counts for a noun are split ~unong 
196 
ridable 2: Contingency tal)le for children of (nutriment) 
ci 
milk) 
<meal) 
(course) 
(d±s~) 
(del±cacy) 
fr\[N(~, cat, oh.i) 
o.o (o.(~) 
J.a (l.r) 
s.a (s.r) 
o.a (~ .s) 
\] 5.d 
r,.~,q(~, oh.i)- 
l't'(x I(~, cal, oh j) 
9.0 (s..,~) 
rs.o (so..o) 
24.r (24.a) 
s2.a (s~ ..0) 
2r.4 (~s.9) 
221.4 
r,4q(~, oh.i) = 
E,,~v r,.~,q(W, .,,, oh.i) 
9.0 
86.5 
26.0 
87.6 
27.7 
the l)erformance of log-likelihood X 2 and Pear- 
son's X ~2 using the l>P-~tttaehment experhnent 
described in Section 5. It was found that the 
log-likelihood ~2 test; did perform slightly bet- 
t('r. \]"or a signitic~nce lew;I ot' 0.05 (which is the 
level used in the exl)eriments), with 4 degrees 
of freedom, the critical wdue is 1,1.86 (llowell, 
;1!197). Thus in this ca.se, tlle null hyl~othesis 
would not be rejected. 
In order to determine top(c, v, r), we conlpare 
l,(vl~7, v) re,: the children of the hypernyms of 
c. hlitially top(c, 'v, r) ix assigned to I)e the con- 
eet)t c itself. Then, l>y worldng Ull the hierarrclly, 
top((:, 'V, r) is reassigned to I)(' successive hyl)er- 
nyms of c until the siblings of tol)(C , ~7+ 7')have 
siglfifi(:a.ntly different prol)abilities. In cases 
where a. concept has more than one I)a.J'ent, the 
parent is chosen which results in tile lowest :\~2 
wflue as this indicates the p(v\[U,r) are more 
simila.r. The set top(c,v,r) is the sinfi\]a.rity- 
cla.ss of c t'or verb v and position r. 
Th(; next section provides evidence that tile 
technique for choosing lOl)(C , v, r), which we call 
the 'simihu'ity-class' technique, does select an 
appropriate level of generalisation. 
5 Experiments using PP-attachment 
ambiguity 
The l>P-atta.chme:nt problem we address con- 
siders 4-tuples of the form v,:,t,,pr, n2, and 
the l)robleln is to decide wllether tile prel)o- 
sitional phrase pr n2 attaches to the verl> v 
or the 71oun nl. For exatnl)le, in the fol- 
lowing cas(; tim l)rol)lent is to decide whether 
alternative senses. YVe rely on the log-likelihood X ,2 test 
returning a, non-significant result in these cases. 
J)'om minister attaches to awaii or approvah 
a.wait apt)7'owd from minister 
We chose the l~P-attachn~ent l)roblenl beca.use 
P l>-attaehment is a perw,.sive form of ambiguity, 
and there exist sta.ndard training and text da.ta~ 
which ma.kes for easy comparisons with other 
a.pproache~s. This p7'oblenl has been tackled by a 
nu nlber of resea.rehers, lh'ill and Resnik (1994), 
Ratnal)arkhi et al. (\]994), Collins (1995), Za- 
w:el and l)aelemans (\] 997) all report results be- 
tween 81% and 85%, with Stetina. and Nagao 
(\] 997) tel)erring a result of 88%, which matches 
lhe hunm,t+ l>erf'ornlan(;e on this task rel)orted by 
Ratnal>arkhi (% al. (199.'1). 
Althougll th(' l)l)-attachnwnt l)roblem has 
chara('teristics that n,a.ke it suita.ble for ('valua.- 
t;ion, it; I)resents a inuch bigger sparse data. t)\]:ol)- 
le, m tlla.n would 1)e exl)ected in other l)roblems 
such as relative (:lausc atSadlment. The reason 
for this is that we need 1;(7 cot,sider how ~l C()l~ - 
Cel)t is associated with combi~zations of predi- 
cates and prel)ositions. T\]le al)proach described 
11(;7"(; uses prolml)ilities of the Ibrnl p(c, prlv ) 
,u,d ~,,(c.z,,l,,.,), who,;o ,~ ~ ~,l(,+~). This .lea,is 
that for many predicate/prel)osition combina- 
tions which occur infl'equently in the d~ta., there 
are few examples of n2 which ca.n be used lot 
populating Wo7'dNet in these cases. Despite 
this, we were still able to carry out an ewl.lu- 
ation by considering subsets of the test (ta.ta for 
which the relewmt predicate~preposition com- 
I)inations did occur frequently in tit(; training 
d at a,. 
We deckle on tile a.tta('hnmnt site by compar- 
197 
ing p(c~, pr\[v) and p(c,~,, p,'\],q), where 
= a rg n ax l,(c,p,'lv) 
c,z 1 = arg max p(c, prlTq ) 
The sense of n2 is chosen which maximises 
the relevant probability in each potential at- 
tachment case. If p(c,,,p,jv)is greater than 1)(%, :m'l~l), 
the attachment is made to v, oth- 
erwise to nl. If n2 is not in WordNet we com- 
pare p(prlv ) and p(prl~t~). Probabilities of the 
form p(c, prlv ) and p(c, prl~tl ) are used rather 
than p(clv,pr ) and p(cl~l,p,j, because the as- 
sociation between the preposition and v and ~q 
contains useful information. In fact, for a lot 
of cases this intbrmation alone can be used to 
decide on the correct attachment site_ The orig- 
inal corpus-based method of \]Jindle and ll.ooth 
(1993) used exactly this information. Thus the 
method described here can be thought of as Hin- 
dle and Rooth's method with additional class- 
based information about n2. 
In order to estimate p(c,,,prlv)(and 
p(C,~l,ln'l,,,,)) we apply the same procedure 
as described in Section 3, first rewriting the 
probability using Bayes' rule: 
p,,.)p(c,,, p,.) p(c,,,j,,lv)-- p(vlcv, v(v) 
p,.) !'(P'q c,, ) 
: p(dc,,, l,(v) 
The probabilities p(c.~) and p(v) can be es- 
timated using maximum likelihood estimates, 
a.nd p(vlcv, p,' ) and j,(p,'lc,) can be esti- 
m.ated using maximum likelihood estimates of 
p(vltop(c~ ,v,p,'),pr) and p(prltop(%,pr)) re- 
spectively. 6 
We used the training and test data described 
in l/.atn.aparkhi et al. (1994:), which, was taken 
Doln the Penn %:eebank and has now become 
the standard data set for this task. The data 
set consists of tuples of the form (v, ~zl, p~', n2), 
together with the attachment site for each tu- 
ple. There is also a development set to prevent 
implicit training on the test set during develop- 
ment. \~e extracted (v, pr, '~2) and (hi, pr, ,z2) 
~ln Section 4 we only gave the procedure for deter- 
mining top(c~, v, pr), but top(c~, pr) can be determined 
in an analogous fashion. 
triples from the training set, and in order to in- 
crease the number of training triples, we also 
extracted triples Kern unambiguous cases of at- 
tachlnent in the Penn %'eebank. We prepro- 
cessed the training and test data by \]emmatising 
the words, replacing numerical amounts with 
the words ~definite_quantity', replacing mone- 
tary amounts with the words 'sum_olLmoney' 
etc. We then ignored those triples in the re- 
sulting training set (but not test set) for which 
7z2 was not in WordNet, which left a total of 
66,881 triples of training data.. The test set 
contains 3,097 examples. 
Table 3 gives seine examples of the ex- 
tent to which the similarity-class technique 
is generalising, using the training data just 
described, and a significance level of 0.05. 
The chosen hypernym is shown in Ul)per 
case. Note that the WordNet hierarchy con- 
sists of nine separate sub-hierarchies, headed 
by such concepts as (entity>, (abstraction), 
(psychological~eature), bnt we assume the ex- 
istence of a single root which dominates each 
of the sub-hierarchies, which is referred to as (root>. 
In cases where WordNet is very sparsely 
populated, it is preferable to go to (root), 
rather than stay at the root of one of the sub- 
hierarchies where the data may be noisy or too 
sparse to be o\[' any use. The table shows that 
with the amount of data ava.ilable from the Tree- 
bank, the similarity-class technique is selecting 
a. level at or close to (root> in many cases. 
We compared the similarity-class technique 
with fixing the level of generalisation. Two tixed 
levels were used: the root of the entire hieraJ'- 
chy ((root>), and the set consisting of the roots 
of each of the 9 sul>hierarchies. The procedure 
which always selects (root} ignores any informa- 
tion about ~z2, and is equivalent to comparing 
p(prlv ) and p(prl,h), which is the ltindle and 
Rooth approach. The results on the 3,097 test 
cases are shown in Table 4. We used a. signifi- 
cance level a of 0.05 tbr the X 2 test. r 
As the table shows, the disambiguation ac- 
curacy is below the state of the art. However, 
the results are comparable with those of l,i and 
rSimilar results were obtained using alternative levels 
of signifiea.nce. Rather than simply selecting a value for 
a, such as 0.05, a' can be tree,ted as a parameter of the 
model, whose optimum value caJl be obtained by running 
the disambiguation method on some held-out supervised 
data. 
198 
'l'al)le 3: Ilow the simila.rity-cla.ss technique chooses top(c, v, pr)a.lld top(c, nq, pr) 
(?Zl, \])?', C) I Iypernyms of c 
( bid,for,(company) ) 
( ~i~io,,,,i,,,,<c~sh> ) 
(v, l", c) 
('l,ol, i,J!q, O J; <t tans act ion>) 
( clo.5",8, (t\[,, <def init e_quant ity>) 
( ,,~, wiU,,,<oeeioia~> ) 
< company> <establ ishment> (or ganisat ion)<social _group> (GROUP) <root> 
(risk) <venture> (task)(+york> (activity)<act> (ROOT> 
( cash>( curt ency)(monetary mystem) (asset)(POSSESS I ON)(root ) 
<transact i on)<group_act ion) <act >(ROOT) 
<D RF II~ I TE_QUAN= TY> <mea~ur e> <abst tact ion> <root> 
<o~i~ial><adjudicator><perso~><li~e~orm><CAUSA~,~G~,NT><e~tity><root> 
'l'able <1: (,o~)lete test set :~()97 test cases 
(~eneralisation technique % co:red. 
SiJnila.rity-cla.ss 80.3 
Select root of sub-hiera.rchy 77.9 
Alwa,ys select (root> 79.0 
Table 5: (root> 1)eing selected for 1)oth a.ttach- 
nlent 1)oints \] 713 test cases 
(hmeralisa+tion technique 
Simila.rity-cla.ss 
Select root of sub-hierarchy 
Ahva.ys select (root> 
% tort'cot 
90.3 
811.4 
79.6 
Abe (119!)8) who a.dol)t a similar a.l>proa(:h us- 
i:ng \VorclNet, but with a, differ<rot training and 
test set. I,i a.nd Abe iml>rOVed on the l\[\]n- 
die and Rooth techni(lue l)y 1.5%, whh;h is i, 
line with our results. As a.n evahla.tion of the 
simibu'ity-class tec\]lnique, the result is incon- 
clusive. The rca.son for this is tha.t when the 
technique wa,s being used to estima.te \])( vlc,,, \])r ) 
a.Hd P(?~.:I \[c.,zl, I)?'), in many cases tile root o1" 1lie 
hiera.rchy wa.s being chosen as the apl>rOl)riat;e 
level of genera.lisa.tion, due to a. sparsely popu- 
la.ted WordNet in tha.t insta.nce. Recall that this 
is la.rgely due to tit<', fa.ct that we a.rc a.ttemltt- 
ing to popula.te WordNet fbr comltina.tions of 
predic~tes ~md prepositions. In such cases tile 
sinlil~u'ity-elass technique is not helping because 
there is very little or no informa.tion a.1)otlt ~,2. s 
aln an effort to obtahl more do.to, we a, pplicd the ex- 
traction heuristic of lla.tna.parkhi (1998) to \¥all Street 
Journa.l text, which increased the nuntl)er of training 
triples by ~L factor of 111. '\['his only a.chievcd comparable 
results, however, presumably boca.use the high volume 
of noise in the dat~ outweighs the benefit of the increase 
in da.ta size. \]{.atnaparkhi reports only 69% a.ccuracy tot 
Table 6: (root> being select(,(I for at most one 
o1' ill(, a.tl.a('hnmnt points 1032 i.cst ('asc.~ 
(l(,,eralisaCioll techniqu(,~ % ('ol:re('t 
Sitnilaril.y-cla.ss 88. I 
Select root of sul)-hierar(:hy 85.5 
Alwa.ys select (root) 8,5.6 
In order to eva.lua.te the similarity-class tech- 
nique further, we took those test cases for which 
tile root wa, s not being selected when estima.ting 
bet:t, J,(,,I,,~. J,') .+,,d \])('/,.1 I~,,. pv). n:\],is .pplied 
to 113 c~ses. The results ~u;e given in Table 5. 
We a.lso took those test cases for which the root 
was I)eing selected when estimating +~t most one 
of p(v\[c+,,pr) a.nd p(,q \[c,~, pr). This a.pplie, d to 
\]032 test ca.sos. The results a.re shown in %> 
ble 6. 
the extraction heuristic when applied to the \]%nn Tree- 
ba.nk (excluding cases where the ln:eposition is of). 
199 
6 Conclusions 
We have shown that when instances of Word- 
Net are well populated with examples of 
n2, the method described here for solving 
P1)-attachment ambiguities is highly accurate. 
When WordNet is sparsely populated, the 
method automatically resorts to comparing just 
the preposition and each of the potential attach- 
ment sites, as the similarity-class technique will 
select {root} as the appropriate level of general- 
\]sat\]on for n2 in such cases. We have also shown 
the similarity-class technique to be superior to 
using a fixed level of general\]sat\]on in WordNet. 
Further work will look at how to integrate 
probabilities such as p(clv, r) into a model of 
dependency structure, similar to that of Collins 
(1996) and Collins (1997), which can be used 
\['or parse selection. However, knowledge of se- 
\]ectional preferences cannot by itself solve the 
problem of structural disambiguation, and this 
further work will also look at using additional 
knowledge, such a.s subcategorisation informa- 
tion. 

References 

Eric Brill and Philip Resnik. 1994. A rule-based 
approach to prepositional phrase attachment 
disambiguation, in Proceedings of the fifteenth International Conference on Computational Linguistics. 

Eugene (~harnia.k. 1993. ,5'tali.slical Language 
Lcarni'ng. The MIT Press. 

Stephen Clark and David Weir. 1999. An iterative approach to estimating frequencies 
over a semantic hierarchy. In Proceedings of 
the Joint SIGDAT Conference on Empirical 
Methods in Natural Language Processing and 
Very Large Corpora pages 258-265. 

Michael Collins. 1995. Prepositional phrase 
attachment through a backed-off model. In 
Proceedings of the Third Workshop on Very 
Large Corpora , pages 27-38, Cambridge, 
Massachusetts. 

Michael Collins. 1996. A new statistical parser 
based on bigram lexical dependencies. In 
Proceedings of the 3/tth Annual Meeting of the 
ACL, pages 184-1911. 

Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 351h Annual Meeting of the Association for Computational Linguistics, 
pages 16-23. 

Ted Dunning. 1993. Accurate methods for the 
statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74. 

Christiane Fellbaum, editor. 1998. WordNct 
An l'2lcctronic Lcxical Database. The MIT 
Press. 

Donald Hindle and Mats Rooth. 1993. Structural ambiguity and lexical relations. Computational Linguistics, 19(1): 103-120. 

l)avid Howell. 11997. Statistical Methods for 
Psychology: ~th cd. Duxbury Press. 

Hang Li and Naoki Abe. 1998. Generalizing case frames using a thesaurus and the 
MDL principle. Computational Linguistics, 
24(2): 17-244. 

Adwait Ratnaparkhi, Jeff Reynar, and Salim 
Roukos. 1994. A maximum entropy model 
for prepositional phrase attachment. In Proceedings of the ARPA Human Language Technology Workshop, pages 250-255. 

Adwait Ratnaparkhi. 1998. Unsupervised statistical models for prepositional phrase attachment. In Proceedings of the Seventeenth International Conference on Computational 
Linguistics, Montreal, Canada, Aug. 

Pllilip Rcsnik. 71993. ,5'clcction a~zd hdbrma- 
lion: A Class-Based Approach to l, czical Re- 
lationships. Ph.l). thesis, University of Penn- 
sylvania. 

Francesc Ribas. 1995. On learning more appropriate selectional restrictions. In Proceedings of the Seventh Conference of the European 
Chapter of the Association for Computational 
Linguistics Dublin, Ireland. 

Jiri Stetina and Makoto Nagao. 1997. Corpus 
based PP attachment ambiguity resolution 
with a semantic dictionary. In Proceedings of 
thc Fifth Workshop on Very Large Corpora, 
pages 66-80, Beijing and ltong Kong. 

David Yarowsky. 11992. Word-sense disambiguation using statistical models of Roget's categories trained on large corpora. In Proceedings of COLING-92, pages 454-460. 

Jakub Zavrel and Walter Daelemans. 1997. 
Memory-based learning: Using similarity for 
smoothing. In Proceedings of ACL/EACL-97, Madrid, Spain. 
