Explaining away ambiguity: Learning verb selectional preference 
with Bayesian networks* 
Massimiliano Ciaramita and Mark Johnson 
Cognitive and l,inguistic Sci('.n(:es 
Box \]978, Brown University 
Pr()vi(h;n(:e, l{,I 02912, USA 
mas s ±m±:l J_ano_ciaramita@broun. edu mj @cs. broun, edu 
Abstract 
This t)at)er presents a Bayesian lnodel for unsu- 
t)(;rvised learning of v(;rb s(;le(-t;ional t)refer(;nc('s. 
For each vert) the model creates a 13~Wo.sian 
n(~twork whose archii;e(:lan'(~ is (l(fl;(wmin(;d t).5' 
the h',xical hicr~m:hy of W()r(hmt mtd whos('~ 
1)ar~mmt;(;rs are ('~sl;im;~l:(~d from a list; of v('d'l)- 
ol)je('t pairs \]'cram from a tort)us. "lgxl)laining 
away", t~ well-known t)rop('xi;y of Baycsi~m net- 
works, helps the moth;1 (teal in a natural fash- 
ion with word sense aml)iguity in tlw, training 
(tat~L. ()n a, word sense disamt)igu;~tion Lest our 
model t)erformed l)ctl;c,r than ot;h(',r stal;(~ of tim 
art systems for unSUl)crvis(~d learning ()t7 seh'x:- 
tionM t)r(d'er(mces. Coml)utational (:Oml)lcxity 
l)rol)lems, wn.ys of improving tiffs ;tl)l)roa(:h mM 
methods for iml)h'menting "('xt)laining away" in 
oth(;r graphical frameworks are dis('ussed. 
1 Selectional preference and sense 
ambiguity 
R('.gularil;i('~s of avcrt) with rcsl)e(:t to t;lw. seman- 
tic class of its m:guments (sul)j('.cl:, ol)j('.(:l; mM 
indirect o\])je(:l;) arc called selectional prefer- 
enees (S1)) (Katz and Fodor, 1964; Chomsky, 
1965; Johnson-Laird, 1983). The verb pilot car- 
ries the information thal; its ol)jecl; will likely 1)e 
some kind of veh, icle; sut)jects of tim vert) th, int,: 
t(md to 1)e h,'uman; ;rod sul)jects of the verb bark 
l;end l;o l)c, dogs. For the sake of simt)licity we 
will locus on the verl)-ot)je(:t relation all;hough 
the techniques we will describe can be at)t)li(;d 
to other verb-argument pairs. 
* We wouhl like to ttmnk the Brown Lal)orat;ory for Lin- 
guistic Inibrmation Processing; Thomas IIoflnann; Elie 
Bienenstoek; 1)hilip Resnik, who provided us with train- 
ing and test data; and Daniel Oarcia for his hel l) with the 
SMILE lil)rary of (:lasses tbr Bayesian networks that we 
used for our exl)eriments. This research was SUl)l)orted 
1)y NSF awards 9720368, 9870676 and 9812169. 
. EN77TY._ 
,vomet\]tiug - " ~ , "-.. " ~ ,~ . 
. _ " " CtHHy 
FOOl) I,IQUll) PIISI:h'ICAI. OltJIiCT 
..... "? /' i "\ , I,\ \\ .... ;- " \\, ICK .lbod 
/II.IMliNT III'VI;'RA(;I" WAT/:'R liqltid lAND olejeet ": 
• / 
71<4 bel'i'rat~e ('OI"H'.'I'; drink carlh land ISLAND ~ro l l ('APE 
<'q~lFe I;'S1't¢1:'S.';0 .IA VA- I ,I/'~ VA-2 IIAI,I i.~hmd 
t ............... .. i e.v~rcs.~o .\]av~ bali 
Figm'e 1: A 1)ortion of Wordnet. 
Models of the acquisition of SP arc impor- 
|;ant in their own right and h;w(', at)plic~tt.ions in 
N~tural l,anguage l~ro(:essing (NIA'). The selcc- 
tional l)rclhr(;nc(;s of ~L verb can b(; used t;o inti;r 
I;he l)ossil)\]c meanings of an mlknown re'gum(mr 
of a known verb; e.g., it; might be possibh; to 
infer that xzzz is ~ kind ot! dog front the tbllow- 
ing sentence: "The :rzJ:z barked all night". In 
p~rsing ;~ sentence seh,ctional l)refe, rcn(:es can 1)(; 
used to rank competing parses, providing a par- 
tim nlt',asur(; of scmmlt;ic well-forlnedness, in- 
v('stigating SI ) might hel l) us to understand the 
structure of the mental lexicon. 
Systems for mlsupervised learning of SP usu- 
ally combine statistical aim knowledge-1)ased 
approaches. The knowledge-base component 
is typicMly a database that groups words into 
classes. In the models w(' will see. the knowl- 
edge base is Wordnet (Miller, 1990). Word- 
net groups nouns into c, lasses of synonyms 
ret)resenting concel)ts , called synsets, e.g., 
{car,,,'ld, o,a,'utomobilc,...}. A noun that lm- 
hmgs to sew:ral synsets is ambiguous. Atran- 
187 
sitive and asymmetrical relation, hyponymy, 
is defined between synsets. A synset is a hy- 
ponym of another synset if the tbrmer has the 
latter as a broader concept; for example, BEV- 
ERAGE is a hyponym of LIQ UID. Figure 1 de- 
picts a portion of the hierarchy. 
The statistical component consists of 
predicate-argument pairs extracted from a 
corpus in which the semantic class of the words 
is not indicated. A trivial algorithm might 
get a list of words that occurred as objects 
of the verb and output the semantic classes 
the words belong to according to Wordnet. 
For example, if the verb drink occurred with 
'water and water E LIQUID, the nlodel would 
learn that drink selects tbr LIQUID. As Resnik 
(1997) and abney and Light (1999) have found, 
the main problem these systems face is the 
presence of ambiguous words in the training 
data. If the word java also occurred as an 
object of drink, since java C BEVERAGE 
and java C ISLAND, this model would learn 
that drink selects tbr both BEVERAGE and 
ISLAND. 
More complex models have been t)roposed. 
These models, though, deal with word sense 
ambiguity by applying an unselective strategy 
similar to the one above; i.e., they assmne that 
anfl)iguous words provide equal evidence tbr all 
their senses. These models choose as the con- 
eepts the verb selects tbr those that are in com- 
mon among several words (e.g., BEVERAGE 
above). This strategy works to the extent that 
these overlapping senses are also the concepts 
the verb selects tbr. 
2 Previous approaches to learning 
selectional preference 
2.1 P~esnik's model 
Ours system is closely related to those proposed 
in (Resnik, 1997) and (Atmey and Light, 1999). 
The fact; that a predicate p selects for a class 
c, given a syntactic relation r, can be repre- 
sented as a relation, selects(p, r, c); e.g., that 
eat selects for FOOD in object position can 
be represented as selects(eat, object, FOOD). 
In (Resnik, 1997) selectional preference is quan- 
tiffed by comparing the prior distribution of 
a given class c appearing as an argument, 
P(c)~ and the conditional probability of |;lie 
same class given a predicate and a syntac- 
COGNI770N 1/4 FOOD 7/4 
ESSENCE 1/4 FLESH 1/4 FRUH" 1/2 BREAD I/2 DAIRY I/2 
l' J l l l 
idea(O) meal(l) apple(I) bagel(l) cheese(l) 
Figure 2: Simplified Wordnet. The numbers 
next to the synsets represent the values of 
freq(p, r, c) estimated using (3), the nmnbers in 
l)arentheses represent the values of frcq(p, r, w). 
tic relation P(c\[p,r), e.g., P(FOOD) and 
J ( OODleat, object), The relative entropy be- 
tween P(c) and P(clp, r) measures how nmch 
the predicate constrains its arguments: 
S(p,r) = D(P(clp, r) II P(c)) (1) 
Resnik defines the selectional association of 
a predicate for a particular class c to be the por- 
tion of the selectional preference strength due to 
that class: 
P(clP, ~') 1 P(clp , r) log (2) A(p,,., c) - S(v,,') P(c) 
Here the main problem is the estimation of 
P(clp, r). Resnik suggests as a plausil)le esti- 
mator /5(clp, r) de=r freq(p, r, c)/freq(p, r). 13ut 
since the model is trained on data that are not 
sense-tagged, there is no obvious way to esti- 
mate freq(p, r, c). Resnik suggests considering 
each observation of a word as evidence tbr each 
of the classes |;lie word belongs to, 
count(p, r, w) 
c) aasses(w) (3) 
~uc:e 
where count(p, r, w) is the nmnber of times 
the word w occurred as an argument of p in 
relation r, and classes(w) is the number of 
classes w belongs to. For exmnple, suppose 
the system is trained on (eat, object) pairs and 
the verb occurred once each with meat, ap- 
ple, bagel, and cheese, and Wordnet is simpli- 
fied as in Figure 2. An ambiguous word like 
meat provides evidence also tbr classes that ap- 
pear unrelated to those selected by the verb. 
Resnik's assumption is that only the classes se- 
lected by the verb will be associated with each 
188 
() ,-<., 
//-" 1\8 \ \-.~S 
COGNIllON ( ) FOOl) 
1\7 "-. " --- 
( --. 1 1 ' I ' I ' 
idea meat apph' bagel cheeae 
Figure 3: The HMM version of the siml)le ex- 
ample. 
of the observed words, and hence will re(:eive 
the highest values for l'(clp, r). Using (3) we 
fill(t that the highesl; fre(luen(;y is in t'~t(;t as- 
social;ed with FOOD: .f'req(ea, I,, objecl,, food) 
I + ~ + 7 _1_ 21 ~ ~ -- 717 an(t I'(leOODie.,l, ) = 0.44. 
H()wever, some eviden('e is tomM also for COG- 
NITION: .fr(:q(cat,, obj,cl,, co9'~,,it, io'~,,) ~ ¼ and 
I'(COGNIT'IONieat) = 0.06. 
2.2 Abney and Light's approach 
Atmey and Light (1999) pointed ouI; l;h~tt l;he 
distril)ution of se.llSeS of all anfl)iguous word is 
no|; unifornl. '\['tiey not;iced also l:hai; it is nol; 
clear how the 1)rol)ability l'((:\[p, 7") is t;o 1)e ini;er- 
t)reted sin(:e there is no exl)lieit sto(:llasl, i(: geil- 
(',ration nlodel involved. 
They \])rol)ose(t a syst;enl l;hat; ass()(',ial;es 
a Hidden Markov Model (\[IMM) wii;h 0,a(:h 
l)re(lic~te-re, lal;ion 1)air (p,r). 'l~'ansil;ions be- 
tween synsel; states rel)resent the hyt)onynly re- 
lation, and c, the empty word, is emitted with 
probal)ility 1; transitions to ~t tinal sta|;e enli|; a 
word w with 1)rot)al)ility 0 < P('m) < I. 35an- 
sit|on and emission t)rol)al)ilities are estimated 
using the \]'DM ;dgorittnn on i;raining data I;hat 
consist of the lieu|is that o(;clirrext with the w;rb. 
Abney and Ligh|,'s mo(lel e, stintates i'(clp, r ) 
ti'om the model trained tbr (p,r); the disl, ri- 
lint|on P(c) (:ml 1)e calculatxxt from a model 
trained for all n(mns in the (:orpus. 
This model did not perti)rm as well as ex- 
pected. All amt)iguous word ill the nlodel call 
be generated 1)y more than one sl;;~te sequence. 
Atmey and Light dis(:overed that the EM al- 
goril;hm tinds t)arameter wflues that associate 
some t)rol)ability mass with all the trmlsitions 
in the lnultil)le paths that lead to an ambigu- 
ous word. In other words, when there are sev- 
eral state sequences fi)r the same word, \]DM does 
not select one of l;hen: over the others. I Figure 3 
shows the I)arameters esl;imated by EM for the 
same examt)le as above. The transition to the 
COGNITION sl;ate has \])een assigne(t a t)rol)a- 
1)|lily of 1/8 because it is part of ~t possible l)ath 
to meat. The IIMM nlodel does not solve t, he 
l)roblent of the unselective distribution of the 
frequen(:y of oceurren(:e of an aml)iguous word 
to all its senses. A1)ney and Light claimed that; 
this is a serious l)roblem, par|;i(:iilarly when l;he 
aml)iguous word is a ti'equent one, and cruised 
the model to learn the wrong seleel;ional pref- 
eren('es. To (:orre(:i; this undesiral)le outcome 
they introduced some smoothing and t)alam:ing 
te, chniques. Howe, ver, even with these modiliea- 
tions their sysl;em's l)erfornlance, was t)elow fllal; 
a(:hieved l)y Resnik's. 
3 Bayesian networks 
A Bayesian network (Pearl, 1988), or 
Bayesian 1)el|el nel;work (BBN), eonsisi;s of a sol; 
of variables and a sel; of directed edges (:on- 
neel;ing the w~riat)les. The variables and tile 
edges detine, a dire,(:te, d acyclic graph (DAG) 
where each wtrial)le is rei)resented 1)y ~ node. 
Ea(:h vmi~d)le is asso(:iated with a finite number 
of (nmi;u;dly ex('lusive) sl;ates. '1)) each wu:ial)le 
A with \]);n'eni;s \]31,..., I7~ is ;t|;l;;mll(',(l ;t condi- 
tio'n, al probability tabh', (CPT) l'(A\[131, ...,Hn). 
Given a BBN, Bayesiml int~rence (:~m 1)e used 
1;o esi;ilm~l;e marginal and posterior proba- 
bilities given the evidence at hand ~md (;It(', in- 
fornlation six)red in the CPTs, the prior prob- 
abilities, by means of B~yes' rule, P(HIE ) = 
l'(H)P(s~ln) P(E) , where It stands fi)r hyt)othesis and 
E tbr eviden(:e. 
Baye, sian nel;works display ml exl;remely inter- 
eslfing t)roi)ert;y called explaining away. Word 
sense mnbiguity in the 1)recess of learning SP de,- 
tines a 1)rot)lem that nlight, l)e solved by a model 
that imt)lements an explaining away strategy. 
Sul)t)ose we ;~re learning the, selectional 1)refer- 
en(:e of drink, and the network ill Figure 4 is the 
'As a nmtt;er of fi*cl;, for this HMM there are (in- 
finitely) many i)aramel;er vahles that nmxinfize the like- 
lihood of t;he training data; i.e., l;he i)arame, l;ers are, not; 
idenl;ifiable. The intuil;ively correct solution is one of 
l;helll, \])ILL SO are infinitely lilalty ()|;her, intuitively incor- 
re(:t; ones. Thus il, is no surprise l;hat the EM algorithm 
emmet; lind the intuitively correct sohlt;ion. 
189 
I ISIAND ¢> BEVERAGE 
.\](11'(1 0 WllICF 
Figure 4: A Bayesian network for word ainbigu- 
ity. 
knowledge base. The verb occurred with java 
and water. This situation can be ret)resented 
as a Bayesian network. The variables ISLAND 
and BEVERAGE represent concepts in a se- 
mantic hierarchy. The wtriables java and water 
stand for I)ossible instantiations of the concet)ts. 
All the w,riables are Boolean; i.e., they are as- 
sociated with two states, true or false. Suppose 
the tbllowing CPTs define the priors associated 
with each node. 2 The unconditional probabili- 
ties are P(I = t,'~,,e) = P(B = t,'~,,~0 = 0.01 and 
P(I = false) = I"(13 = false) = 0.99, and the 
CPTs for the child nodes are 
I =>) I 
1,13 1, ~13 ~I,13 -~I, ~f3 
j = true 0.99 (I.99 0.99 0.01 
j = false 0.01 0.01 0.01 0.99 
w = true 0.99 0.99 0.01 (I.01 
w false. 0.01 0.01 0.99 0.99 
These vahms mean that the occu,'rence of either 
concept is a priori unlikely. If either concept is 
true the word java is likely to occur. Similarly, 
if BE VERA GE occurs it; is likely to observe also 
the word water. As the posterior probabilities 
show, if java occurs, the belief~ in both concepts 
increase: P(II.j) = P(BIj ) = 0.3355. However, 
'water provides evidence for BEVERAGE only. 
Overall there is more evidence for the hypoth- 
esis that the concept being expressed is BEV- 
ERAGE and not ISLAND. Bayesian networks 
implement this inference scheme; if we compute 
the conditional probabilities given that both 
words occurred, we obtain P(BIj , w) = 0.98 and 
P(I\[j, w) = 0.02. The new evidence caused the 
"island" hyt)othesis to be explained away! 
3.1 The relevance of priors 
Explaining away seems to depend on the spec- 
ification of the prior prolmt)ilities. The priors 
2I, 13, j and w abbreviate ISLAND, 1lEVERAGE, 
java and water, respectively. 
(-)coaNmo,v ) roe, \[0 ii~ \] L_ m 
Figm'e 5: A Bayesian network for the simple 
example. 
define the background knowledge awdlable to 
the model relative to the conditional probabili- 
ties of the events represented by the variables, 
but also about the joint distributions of several 
events. In the simple network above, we de- 
fined the probat)ility that either concept is se- 
lected (i.e., that the correst)onding variable is 
true) to be extremely small. Intuitively, there 
are many concepts and the probability of ob- 
serving any particular one is small. This means 
that the joint probability of the two events is 
much higher in the case in which only one of 
them is true (0.0099) than in the case in which 
they are both true (0.0001). Therefore, via the 
priors, we introduced a bias according to which 
the hypothesis that one concept is selected will 
be t/wored over two co-occurring ones. This is a 
general pattern of Bayesian networks; the prior 
causes simpler explanations to be preferred over 
more complex ones, and thereby the explaining 
away effect. 
4 A Bayesian network approach to 
learning selectional preference 
4.1 Structure and parameters of the 
model 
The hierarchy of nouns in Wordnet defines a 
DAG. Its mapping into a BBN is straightibr- 
ward. Each word or synset in Wordnet is a 
node in the network. If A is a hyponym of B 
there is an are in the network from B to A. All 
the variables are Boolean. A synset node is true 
if the verb selects for that class. A word node is 
true if the word can appear as an argument of 
the verb. The priors are defined tbllowing two 
intuitive principles. First, it is unlikely that a 
verb a priori selects for troy particular synset. 
Second, if a verb does select for a synset, say 
FOOD, then it; is likely that it also selects tbr 
190 
i(;s hyl)onyms, say FR.UFI'. The sam(', l>rin(:il)les 
;H)t)Iy (;o words: it is likely l;h~t a wor(t ;q)I)ears 
as an m:gmnent of t,h(; verl) if the vert) seh;(:l;s for 
any of il;s possible senses. On l;he other h~m(t, 
if (,he verb does nol; selecl; for a synsel;, it; is 
'u, nlikely that the words insl;anl;b~l;ing (;he synse(; 
occur ~s its ;~rgumen~s. "Likely" and "unlikely" 
are given mml(;rical values l;h~l; Sllill 1l t) 1;o 1. 
The following l;at)le defines l;\]te scheme for the 
CPTs associated wil;h each node in the nel;work; 
pi(X) (lcnot;cs 1;12(: il.h, t2ar(:n(, of (;h(: no(t(: X. 
F__\]. P(X = xlI, I(X)V,... , VI,,(X) = t,",,(~) \] 
z .lalsc, unlihcly 
\[_ T l"(X = :,:lp,(X)a,...,/>,,(X)= .l',,.Z.~(:) \] 
:t; .fitls(: like.l?/ 
For (;h(; rool; nod(:s, the l;;d)le r(:(hl(:es 1;o (;h(: 
uncondil;ion~d t)rol)al/iliI;y of (;h(: node. Now 
WO, (;}UI I;(:Si; (;he mo(M on (;h(: siml)l(: ex~mq)\](: 
seen (:~rli(:r. W + is th(: set of wor(ls l;ha( o(:- 
(:m'rc(t with I;he v(:rl). 'l.'\]l(: \]2o(l(:s (:orr(:st)on(l- 
ing (;o |;he wor(ls in l/l/q ;w(: s(:l, 1;o lr,,,c ;rod 
(;h(: o(;hers l(:f(; mls(:l;. For 1;12(: l)revious ex- 
ample W -I = {mca/,~ apph'., bagel, ch, c~c'.sc'.}, mt(\] 
l;h(; (:orrest)on(ting no(t(:s m:c sol; (;o I/rue, as (t(:- 
t)i(:l;e(t in Figm'(; 5. Wil;h lil,:(dy ;m(t ',.nhT~:c'.ly 
r(:sl)(:(:l;iv(:ly equal Ix) 0.9.0 mM 0.01, tit(: i)osl;(:- 
rior l)rol);d)ilil;i(:s are a /)(/i'\[m, a., b, c.) = 0.9899 
mM .P(Cl,m,, a, b, c) = 0.0101. Expla.ining away 
works. Th(; t)osl;(;rior 1)rol);fl)ilil;y of COGNI- 
TION g(;l;s as low as i(;s prior, whcr(',as l;h(; 
1)rol)al)ilil;y of FOOD goes u t) to almost; 1. A 
13~y(;sim~ n(;l;work ~q)t/roa(:\]~ seems 1;o ;l(:l;u~dl.y 
imt)leancn(; (;he. conse.rva/,'ive, stral;egy w(: l;hough(; 
(;o 1)e (;he corr(;(:I; one. for unsupervisc(t l(;~mfing 
of sehx-t, ional resi;ri(:tions. 
4.2 Computational issues in building 
BBNs based on Wordnet 
Th(: imt)l(:m(:ni;~d;ion of a BBN for (;h(; whole of 
W()r(lnel; fax:as (:oml)ul;al;ional (:oml)lexi|;y pro|)- 
l(:ms (;ypi(:al of graphi(:al too(Ms. A (l(:ns(:ly 
(:ommcI;ed \]IBN presents (;wo kinds of l)rol)l(:ms. 
The tirst is (;he storage of the CPTs. The size 
of a CPT grows extIonenl;ially with the nmnber 
of parents of the node. 4 This prol)lem can lie 
aF, C, m, a, b and c respec(;ively stand for FOOD, 
COGNITION, meat, apph'., bagd and chces(: 
4Some words in \Vordnet have mor(: than 20 senses. 
For (,Xaml)h: , line in \'Vordne(: is asso(:ia(,ed with 25 
EN77TY 
• j 
¢ "~. 
- / 
I"0()1) I,IQUID 
\- ..// 
B E VERA GE 
('OFI"EI¢ drink 
I 
,IA VA - I 
I>IISYSIC'AL OB,II'2CT 
1AND t 
ISIANI) >/ 
./ 
/- 
JAVA -2 X. /:/ 
j<,ra 
Figure. 6: Th(; sulmel;work for d'riuJ<. 
solved 1)y ot)i;infizing the r(:l)r(:s(;nl;~ti;ion of thes(: 
(;;d)l(:s. In our case most of l,h(: (:ntri(:s h~w(: l;he 
s~tln('~ v;~luos~ ~ttl(l ~t COlll\])a,(:(; rel)resenl;al;ion for 
l;lmln (;&n l)(; fOllll(\] (ltlll(;h like l;h(', ()he llS(Xl in 
(;h(', noisy-OR too(tel (Pearl, 1{)88)). 
A h;~r(l(:r lirol)lem is lmrforming inf(;rc\]me. 
The gr;q)hi(:al sl;rlt('i;llr( ~, of & BBN r(;t)resenl;s 
l;h(: (l(:t)(:n(len(:y r(:lal;ions among th(: rml(lom 
wtrial)lcs of the, nel;work, r\]?he ~dgoril;hlns use(t 
wil;lt B\]INs usmdly l)(M'orm inference t)y (ty- 
mmfi(" t)rogrmmning on the tri~mgul~d;e(t lnor~d 
gr~q)h. A low(n" 1)(mn(t on l, he mmfl)er of (:om- 
l)ul;al;ions l;h;~(; are n(:(:(:ssa.ry I;() mo(t(:I l;h(; joint 
(lisl;ritmi;ion ov(:r l;h(: wn'bd)h:s using su(:h ~dgo- 
ril;hms is 21"1t I wh(:r(: 'r~, is t;\]m size of (:h(; ma.x- 
imal l)(mn(tary s(:l; a(:(:or(ling (;o t;hc visil;a,tion 
s(:h(:(lul(:. 
4.3 Subnetworks and balancing 
B(,(:mls(; of l;h(:s(, 1)rol)h:ms w(: (:(told not t)uihl a 
singl(: BBN for Wor(hmI;. Insl;e~M w(: simt/litie(t 
(;he sl;rll('l;ur(: of 1;12(: model by building a smaller 
sutmei;work for each 1)re(ticate-argumenl; pair. A 
sulm(:twork consis(;s of (;he mlion of the s(:ts of 
ml(:(;sl;ors of the words in W +. Figure. 6 pro- 
vid(:s ml example of the union of these :%nces- 
tral sul)grat)hs" of Wordne(; for (;he words java 
~m(l drink (COml)~We i(; wil;h Figure 1). 
This siml)liti(:ation (toes not atfe(:t the (:om- 
pul;~tion of the (tistril)ui;ions we are inl;(;resl;ed 
in; l;h;fl; is, the marginals of the synset nodes. 
A BBN provi(tes a coral)act representation tbr 
the joinI; disl;rit)ution over the set of variables 
senses. The size of its OPT is therefor('. 2 2(~. Six)ring a (:a- 
1)Ie of tloa(; numbers tbr l;his node alone requires around 
(2'-)~)8 = 537 MBytes of memory. 
191 
in the network. If N = X1, ..., Xn i8 a Bayesian 
network with variables X1,..., Xn, its joint dis- 
tribution P(N) is the product of all the condi- 
tional probabilities specified in the network, 
P(N) = II P(XJl p,,(Xj)) (4) 
J 
where pa(X) is the set of parents of X. A BBN 
generates a factorization of the joint distribu- 
tion over its variables. Consider a network of 
three nodes A, B~ C with arcs fl'om A to \]5 and 
C. Its joint distribution can be characterized as 
P(A, B, C) = P(A)P(BIA)P(CIA ). If there is 
no evidence for C the joint distribution is 
P(A,B,C) = P(A)P(BIA ) ~ P(CIA ) 
c 
= P(A)P(BIA ) 
= P(A, B) 
The node C gets marginalized out. Marginaliz- 
ing over a childless node is equivalent to remov- 
ing it with its connections from the network. 
Therefore the subnetworks are equivalent to the 
whole network; i.e., they have the same joint 
distribution. 
Our model comtmtes the value of P(c\[p,r), 
lint we did not compute the prior P(c) for all 
n(mns in the cortms. We assumed this to be 
a constant, equal to the 'u'nlihcly wflue, for all 
classes. In a BBN the wdues of the marginals 
increase with their distance fl'om the root nodes. 
To avoid undesired bias (see table of results) we 
defined a balancing formula that adjusted the 
conditional probabilities of the CPTs in such a 
way that we got; all tim marginals to have ap- 
proximately the same wdue) 
5 Experiments and results a 
5.1 Learning of seleetional preferences 
When trained on t)redicate-argument pairs ex- 
tracted from a large corpus, the San .Jose Mer- 
cury Corpus, the model gave very good results. 
The corpus contains about 1.a million verb- 
object tokens. The obtained rankings of classes 
according to their posterior marginal probabili- 
ties were good. Table 1 shows the top and the 
'~More details can be found in an extended version of 
the paper: www.cog.brown.edu/~massi/. 
6For these experimc'nts we used values for the likely 
and unlikely 1)arameters of 0.9 and 0.1~ respectively. 
Ranking Synset P(clp, r) 
1 VEIIICLE 0.9995 
2 VESSEL 0.9893 
3 AIRCRAFT 0.9937 
4 AIRPLANE 0.9500 
5 SHIP 0.9114 
255 CONCEPT 0.1002 
256 LAW 0.1001 
257 PIIILOSOPIIY 0.1000 
258 ,IUI?,ISPRUDENCE 0.1000 
Table 1: Results tbr (maneuver, object). 
bottom of the list of synsets for the verb ma- 
neuver. Tile model learned that maneuver "se- 
lects" for melnbers of the class VEttlCLE and 
of other plausible classes, hyponynls of I/EHI - 
CLE. It also learned that the verb does not 
select for direct; objects that are inembers of 
(:lasses, like CONCEPT or PItILOSOPltY. 
5.2 Word sense disambiguation test 
A direct ewfluation measure for unsupervised 
learning of SP models does not exist. These 
lnodels are instead evaluated on a word-sense 
disambiguation test (WSD). The idea is that 
systems that learn SP produce word sense dis- 
amt)iguation as a side-effect. Java might be in- 
terl)reted as the island or the beverage, but in a 
context like "the tourists flew to Java" the for- 
mer seems more correct, because fly could select 
for geographic locations but not for beverages. 
A system trained on a predicate p should he 
able to disambiguate arguments of p if it has 
learned its selectional restrictions. 
We tested our model using the test and 
training data developed by Resnik (see Resnik, 
1997). The same test was used in (Almey 
and Light, 1999). The training data consists 
of predicate-object counts extracted fl'oln 4/5 
of the Brown corpus (at)out 1M words). The 
test set consists of predicate-object pairs from 
the remaining 1/5 of the corpus, which has 
been manually sense-annotated by Wordnet re- 
searchers. The results are shown in Table 2. 
The baseline algorithm chooses at random one 
of the multiple senses of an ambiguous word. 
The "first sense" method always chooses the 
most frequent sense (such a system should be 
trained on sense-tagged data). Our model per- 
192 
Method Result 
Baseline 28.5% 
Abney mM Light (HMM smootlm(l) 35.6% 
Abney and Light (ItMM 1)alan(:ed) 42.3% 
Resnik 44.3% 
BBN (without bM~mcing) 45.6% 
BBN (with bMancing) 51.4(/(/ 
First Sense 82.5(/0 
Table 2: R,esults 
formed 1)etter th;m the state of tlm art mo(tels 
for mlSUl)ervised le~n'ning of SP. It seems to de- 
fine i~ l)ett(;r esi;ima, l;or for \])(ell), 'r). 
It is remnrkabh: fliat the model ~mhi(:ved this 
result making only a limi(;(:(t use of distribu- 
tional informal;ion. A n(mn is in 14 f+ if it oe- 
('urred at h;ast once in th(: tra,ining set, 1)ut the 
sysi;em does not know if i(; o(:(:urre(l once or sev- 
eral times; either it oc(:urred or it didn't. The 
nl()(te, l did not; suffer too mu(:h froxn this limi- 
(;ntion (htril~g (;his task. This is 1)rol)~d)ly (tuc 
to the Sl)arsencss of the (;rltining (t;tl;a for (;tie 
test. For each verb (;he average mmfl)er ()f el)- 
\]eel; tyl)es is 3.3, for each of them tim :werng(: 
lxumber of (;ok(ms is 1.3; i.e., most of the, words 
in the training data. only ()(:(:urred once. Pin" 
this training set we ~dso t(:sted a version of (;he 
m()del that |rail(; it word node ti)r each el)served 
ol)je(:l; token ;~n(| (;here, fore inl,(~gral;e(t the (listri- 
lmtional informntion. ()n (;\]m WSI) test it; per- 
ti)rmed exactly the stone its the simph~r version. 
When trained on the, San .lose Mercury Cort)us 
the model l)erfornle(t worse on the WSI) t:esl; 
(35.8%). This is not too surprising considering 
(;he diftbxelmes beA;ween tim SJM and the \]~rown 
(:ort)or~: the former, i~ re(:ent newswire cortms; 
the llg;ter, &Xl older, |)alml(:ed ('orl)uS. Allot;her 
ilnportmlt factor is the different relevance of dis- 
tributional informntion. The training (tatn fl'om 
the SJM Corlms are nnlch ri(:her and noisier 
than the Brown data. Here the, fl'equen(:y in- 
tbrm~tion is probably crucbfl; however, in this 
case we could not imt)lement the silnl)le s('hem(~ 
;tb ove. 
5.3 Conclusion 
Explaining away imt)lements ;~ (:ognitively ~tt- 
tractive and successful strategy. A straighttbr- 
ward lint)rove, men( would lm tbr the me(tel to 
make flfll use of the distrilmtional ildbrmtLtion 
present in the training data; we only partially 
achieved this. B~yesian networks are usually 
confronted with a single present~Ltion of evi- 
den('e. Their exi;ension to multil)le evidence is 
not triviM. We believe the model can be ex- 
tended in this direction. Possibly there m'e sev- 
erM ways to do so (muir(hernial Saml)ling , ded- 
i(:ated implementations, etc.). However, we be- 
lieve that the most relevmit finding of this re- 
search mighl; t)e (;h~t(; %xplnining itww" is not 
only a 1)roperl;y of Bayesian networks but of 
\]3ayesimx infe, rence in general and that it might 
tm imt)lemental)le in other kinds of graphical 
models. We, observed that (;he prol)erty seems to 
del)end on the specification of the prior proba- 
bilities. We t'omld (;lm(; (;he HMM model of (Ab- 
hey mid Ligh(;, 1999) was 'unidentifiable; that is, 
(;here are several sohli;iolxs tbr the \])~tra, xnel;els of 
the lno(te,1, including the desired Olle. OHr intu- 
ition is (;hat it shouht l)e possible to imt)lemelxt 
"exl)laining awlty" in a HMM with 1)tiers, so 
(;hit(; il; wouh\[ 1)rethr only ()lie or ~t ti~w solu(;ions 
()ver lmuiy. This model would have also the a(t- 
wmtttgc of t)eing (:Onll)U|;~d;ionally silnt)ler. 

References 

S. Abney and M. Light. 1999. Hiding a semantic hierarchy in a Markov model. In Proceedings of the Workshop on Unsupervised Learning in Natural Language Processing, ACL. 

N. Chomsky. 1965. Aspcct.s of th, e Theory of 
Sy'nta,:r. MIT Press, Cambridge, MA. 

P. N. Johnson-Laird. 1983. Mental Models: 2b- 
'wards a Cognitive Sciev, ce of Lang'uage , Pn:fi',r- 
c'nce, a'n,d Con.sciousncss. ilm'w~rd University 
Press. 

J..\]. Kntz and .\]. A. Fodor. 1964. The. struc- 
ture of ;~ semmiti(: theory. In ,\]. ,J. KaI;z mM 
J. A. l/odor, editors, The St'ructure of Lan- 
g'uage. Prentice-Ilall. 

G. Miller. 299(I. Wordnet: An on-line lexical 
database, btternational Journal of Lexicog- 
raphy, 3(4). 

J. Pearl. \] 988. Probabilistic R, easo'n, ing in Intel- 
ligent, Systems: Net'worl¢s of Plausible l~l:fer- 
e'n, ee. Morgml Kimthmn. 

P. Resnik. 1997. Selectional preference and 
sense disambiguation. In Proceedings of the 
ANLP-97 Workshop: Tagging Text with Lexical Semantics: Why, What, and How? 
