Word Sense Disambiguation of Adjectives 
Using Probabilistic Networks 
Gerald Chao and Michael G. Dyer 
Coinl)uter Science Det)artment, 
University of California, Los Angeles 
Los Angeles, California 90095 
gerald@cs.ucla.edu, dyer~cs.ucla.edu 
Abstract 
In this paper, word sense dismnbiguation (WSD) ac- 
curacy achievable by a probabilistic classifier, using 
very milfimal training sets, is investigated. \Ve made 
the assuml)tiou that there are no tagged corpora 
available and identified what information, needed by 
an accurate WSD system, can and cmmot be auto- 
matically obtained. The lesson learned can then be 
used to locus on what knowledge needs malmal an- 
notation. Our system, named Bayesian Hierarchical 
Disambiguator (BHD), uses the Internet, arguably 
tile largest corlms in existence, to address the st)arse 
data problem, and uses WordNet's hierarchy tbr se- 
mantic contextual features. In addition, Bayesian 
networks are automatically constructed to represent 
knowledge learned from training sets by lnodeling 
the selectional i)retbrence of adjectives. These net- 
works are then applied to disaml)iguation by per- 
tbrming inferences on unseen adjective-noun pairs. 
We demonstrate that this system is able to disam- 
biguate adjectives in um'estricl;ed text at good initial 
accuracy rates without tile need tbr tagged corpora. 
The learning and extensibility aspects of the model 
are also discussed, showing how tagged corpora and 
additional context can be incorporated easily to im- 
prove accm'acy, and how this technique can be used 
to disambiguate other types of word pairs, such as 
verb-noun and adverb-verb pairs. 
1 Introduction 
Word sense disambiguation (WSD) remains an open 
probleln in Natural Language Processing (NLP). Be- 
ing able to identify tile correct sense of an mnbigu- 
ous word is ilnportant for many NLP tasks, such as 
machine translation, information retrieval, and dis- 
course analysis. The WSD problem is exacerbated 
by the large number of senses of colmnonly used 
words and by the difficulty in determining relevant 
contextual features most suitable to the task. Tile 
absence of semantically tagged corpora makes prob- 
abilistic techniques, shown to be very effective by 
speech recognition and syntactic tagging research, 
difficult to employ due to the sparse data problem. 
Early NLP systeins limited their domain and re- 
quired manual knowledge engineering. More recent 
works take advantage of machine readable dictio- 
naries such as WordNet (Miller, 1990) and Roget's 
Online Thesaurus. Statistical techniques, both su- 
pervised learning from tagged corpora (Yarowsky, 
1992), (Ng and Lee, 1.996), and unsupervised learn- 
ing (Yarowsky, 1995), (Resnik, 1997), have been in- 
vestigated. There are also hybrid inodcls that in- 
corporate both statistical and symbolic knowledge 
(Wiebe et al., 1998), (Agirre and I{igau, 1996). 
Supervised models have shown promising results, 
but the lack of sense tagged corpora often requires 
the need tbr laboriously tagging trailfing sets man- 
ually. Depending on the technique, unsupervised 
models can result in ill-defined senses. Many have 
not been evaluated with large vocabularies or flfll 
sets of senses. Hybrid models, using various heuris- 
tics, have demonstrated good accuracy but are ditfi- 
cult to compare clue to variations in the evahlation 
procedures, as discussed in Resnik and Yarowsky 
(\]997). 
In our Bayesian Hierarchical Disambiguator 
(BHD) model, we attempt to address some of the 
main issues faced by today's WSD systelns, namely: 
1) the sparse data problem; 2) the selection of a 
fi;ature set that can be trained upon easily without 
sacrificing accuracy; and 3) the scalability of the sys- 
tein to disambiguate um'estricted text. The first two 
problems can be attributed to the lack of tagged 
corpora, while the third results from the need for 
lland-annotated text as a method of circumventing 
the first two problems. We will address the first two 
issues by identiflying contexts in which knowledge 
can be obtained automatically, as opposed to those 
that require minimal manual tagging. The effective- 
hess of the BHD model is then tested on unrestricted 
text, thus addressing the third issue. 
2 Problem Formulation 
WSD can t)e described as a classification task, 
where the i th sense (W#i) of a word (W) is clas- 
sified as the correct tag, given a word and usually 
some surrounding context. For example, to disam- 
biguate the adjective "great" in the sentence "Tile 
152 
great hm'riealm devastated the region", a \VSI) sys- 
tem should disambiguate "great" as left.q(: in .s'izc 
rather than the good or c:cccllent meaning. Us- 
ing l)robability notations, this t)rocedure can be 
st~ted as maxi(Pr(great#i \[ "great", "ihe", "lnn'- 
ricane", "devastated", "the", "region")). That is, 
given the word "great" and its context, elassit~y the 
sense great#/ with the highest probability as the 
correct o11e. However, a large COllteX\[;~ sll(:h as the 
whole sentence, is rarely used, due to the dilliculty 
in estimating the 1)robability of this particular set of 
words oceurring. Therefore, the context is usually 
narrowed, such as ,z number of surromMing words. 
Additionally, surrollnding syntactic tbatures aim se- 
mantic knowledge are sometimes used. The diffi- 
culty is in choosing the right context, or the set 
of features, that will ol)timize the classification. A 
larger (:ontext intl)roves t;11(! classiti(:ation a(:(:ura(:y al 
the exl)ense of increasing the mlml)er of l)arameters 
(typically learned fr(nn large training data). 
in our BHI) model, a minimal context (:Oml)OSe(l 
of only the adje(:tive, noun and the llOllll;S Selllall- 
tie features obtained fl'()m \VordNel is used. Us- 
ing the al)ove examl)le , only "great", "hurri(:ane" 
and hun'icane's t)atures encoded in WordNet's hi- 
erarchy, as in hurricane lISA cyclone lISA wind- 
storm ISA violent storm..., are used its (;Olltext. 
Therefore, the classitieation 1)ertbrnmd 1)y Bill) (:an 
t)e written as nlaxi(lh(great¢~i \[ "great", "\]nlrri- 
cane", cyt:loltc, windslorm ...)), or more generi- 
cally, ,m(,,:~( l'r(,dj#il(,dj , ,,>,,.,,, < N \],',; >)), ,vher(, 
<N/Zs> denotes the n(mn features. By using the 
Bayesian inversion fornmla, this equati()n l)e(:onw.s 
.,,,.<,:,.~(Pr(,.4i,'l,.o,.,., < N\],'.s > I.,!i#i) x P,(,,4i#i)). 
Pr(,d:i,'no'u,., < Nt;'.s" >) 
(1) 
This eonte×t is chosen because it does not; need an 
annotated training set, and these semantic fe.atures 
are used to l)uiht a belief about the nouns ,:ill adjec- 
tive SellSe typically lnodifies, i.e., the se\]ectional pref- 
erences of a(ljectiv('.s. For examlfle , having learned 
about lmrrieane, the system can infer the most prob- 
able dismnbiguation of "great typhoon", "great tor- 
nado", or ltlore distal concepts such as earthquakes 
and tloods. 
3 Establishing the Parameters 
As shown in equation 1, BH1) requires 
two parameters: 1) the likelihood term 
Pr(adj, noun,< NFs > ladj#i) and 2) the prior 
term Pr(adj#i). The prior term represents the 
knowledge of how frequently a sense of an adjective 
is used without any contextual infornmtion. Dn" 
example, if great#2 (sense: 9ood, excellent) is 
nsed frequently while great#l is less commonly 
used, then Pr(great#2) wtmld be larger than 
Pr(great#l), in l)rOl)ortion to the usage of the 
two senses. Although WordNet orders the senses 
of a t)olysenl(ms word according to usage, the 
actual t)rot)ortions are not quantified. Therefore, to 
(:omtmte the priors, one can iterate over all English 
nouns and sunl the instmlces of great#l-noun 
versus great#-2-noun t)airs. But siilce we assume 
that no training set exists (the worst possible ease 
of the sparse data l)rol)lem), these counts need to 
1)e estinmted from ilMireet sources. 
3.1 The Sparse Data Problem 
The techuique used to address data sI)arsity, as first 
proposed by Mihalcea and Moldovan (1998), treal;s 
the Internet as a cortms to automatically (lisam- 
biguate word 1)airs. Using the previous examl)le , to 
disambiguate the adjective in "great hurricane", two 
synonym lists of ("great, large, t)ig") and ("great, 
neat, good") are retrieved from V¢ordNet. (SOllle 
synonyms ~111(1 el;her SellSeS i/re olllitted here for 
1)revity.) Two queries, ("great hurricane" or "large 
hurrit:ane?' or "t)ig hurricane") and ("great hurri- 
cane" or "neat hurricane" or "good hurriemm"), are 
issued to Altavista, which reporl, s that 1100 and 914: 
1)ages contain these terlllS, respectively. The query 
with the higher count (#1) is classitied its the correct 
sense. For fllrther details, please refer to Mihaleca 
anti Moldovau (1998). 
In our luo(lel, the (:omd;s front Altavista are im:or- 
l)orated as \[)aralltel;er estilllatiollS within our proba- 
bilislh: frlt111(~work. Jill addition to disaml)igualing 
lhe adjectives, we also need to estimale lhe us- 
age of the a(ljec|;ive#i-noml pair. For simt)licil;y , 
the counts fronl Altavista are assigued wholesale 1o 
the disambiguated adjective sense, e.g., the usage 
of great#l-hurricane is 1100 times and great#2- 
hurricane is zero times. This is a great simplifi- 
cation since in many adjective-noun pairs nmltil)le 
meanings are likely. For instance, in "great sl;eak", 
both sense of '"great" (large steak vs. tasty steak) 
are e, qually likely. However, given no other infor- 
mation, this sinlt)lification is used as a gross ap- 
t)roximation of Counts(adj#i-noun), which becoines 
Pr(adj#i-noun) by dividing the counts by a nornml- 
izing constant, ~ Com~ts(adj#i-all nouns). These 
prolmbilities are then used to compute the priors, 
described in the next section. 
Using this technique, two major probleins are ad- 
dressed. Not only are the adjectives automatically 
disambiguated, but the number of occurrenees of the 
word pairs is also estimated. The need lot haud- 
annotated semantic corpora is thus avoided. How- 
ever, the statistics gathered by this technique are 
at)proxinmtions, so the noise they introduce does re- 
quire supervised training to nfinimize error, as will 
lm described. 
153 
3.2 Computing the Priors 
Using the methods described above, the priors cats 
be automatically computed by iterating over all 
nouns and smnnfing tile counts tbr each adjective 
sense. Untbrtunately, the automatic disambiguation 
of tim adjective is not reliable enough and results it, 
inaccurate priors. Therefore, manual classification 
of assigning nouns into one of the adjective senses 
is needed, constituting the first of two manual tasks 
needed by this model. However, instead of classi- 
fying all English nouns, Altavista is again used to 
provide collocation data on 5,000 nouns for each ad- 
jective. The collocation frequency is then sorted and 
the top 100 nouns are manually classified. For ex- 
ample, the top 10 nouns that collocate after "great" 
are "deal", "site", "job", "place", "time", "way", 
"American", "page", "book", and "work". They are 
then all classified as being modified by the great#2 
sense except for tile last one, which is classified into 
another sense, as defined by WordNet. Tim prior 
for each sense is then coinputed by smmning the 
counts Don, t)airing the adjective with the nouns 
classified into that sense and dividing by the sum 
of all adjective-noun pairs. The top 100 collocated 
nouns fbr each adjective are used as an al)proxima- 
tion for all adjective-noun pairs since considering all 
nouns would be impractical. 
To validate these priors, a Naive Bayes classifier 
that comt)utes 
v,.azi Pr(adj, ,!,o'l,r@,dj#i) x Pr(,,dj #i) 
Pr ( a4i, ~o,. ,. ) 
is used, with the noun as the only context. This 
simpler likelihood term is approxinmted by the same 
Internet counts used to establish the priors, i.e., 
Counts(adj#i-noun) / normalizing constant. In Ta- 
ble 1, the accuracy of disambiguating 135 adjective- 
noun pairs Dora the br-a01 file of the semantically 
tagged corlms SemCor (Miller et al., 1993) is com- 
pared to the baseline, wlfich was calculated by using 
the first WordNet sense of the adjective. As men- 
tioned earlier, disambiguating using, simply the high- 
est count Dot** Altavista ("Before Prior" it, Table 
1) achieved a low accuracy of 56%, whereas using 
the sense with the highest prior ("Prior Only") is 
slightly better than the baseline. This result vali- 
dates the fact that the priors established here pre- 
serve WordNet's ordering of sense usage, with tile 
improvement that tim relative usages between senses 
are now quantified. 
Combining both the prior and the likelihood terms 
did not significantly improve or degrade tile accu- 
racy. This would indicate that either the likelihood 
term is uniformly distributed across tile i senses, 
which is contradicted by the accuracy without the 
priors (second row) being significantly higher than 
the average number of senses per adjective of 3.98, 
Accur&cy 
Before Prior 56.3% 
Prior Only 77.0% 
Combined 77.8% 
Baseline 75.6% 
Table 1: Accuracy rates from using a Naive Bayes 
classifier to wflidate the priors. These results show 
that the priors established in this model are as ac- 
curate as the WordNet's ordering according to sense 
usage (Baseline). 
or, more likely that this parameter is subsumed by 
the priors due to the limited context. Therefore, 
more contextual infbrmation is needed to improve 
tim model's peribrmance. 
3.3 Contextual Features 
Instead of adding other types of context such as 
the surrounding words and syntactic features, tlm 
semantic features of tim noun (as encoded in the 
WordNet ISA hierarchy) is investigated for its effec- 
tiveness. These features are readily available and are 
organized into a well-defined structus'e. Tim hierar- 
chy provides a systematic and intuitive method of 
distance measurements between feature vectors, i.e., 
the semantic distance between concepts. This prop- 
erty is very important for inthrring the classification 
of the novel pair "great flood" into the sense tha.t 
contains hurricane as a member of its prototypical 
nouns. These prototypical nouns describe tile se- 
lectional preferences of adjective senses of "great", 
and the semantic distance between them and a new 
noun ineasures 1;11o "semantic fit" between the con- 
cepts. Tile closer the.y are, as with "hurricane" and 
"flood", the higher the prot)ability of the likelihood 
term, whereas distal concepts such as "hurricane" 
and "taste" would have a lower value. 
Representing these prototyt)ical 11011218 prot)abilis- 
tically, however, is difficult due to the exponential 
number of 1)robal)ilit, ies with respect to the nmnber 
of features. For exmnl)le, ret)resenting hurricmm 1)e- 
ing I)resent in a selectional t)reference list requires 
2 s probabilities since there are 8 features, or ISA 
l)arents, in the WordNet hierarchy. It, addition , tile 
sparse data t)roblem resurfaces because each one of 
the 2 s probabilities has to l)e quantified. To ad- 
dress these two issues, belief networks are used, as 
described it, detail in the next section. 
4 Probabilistic Network's 
There are many advantages to using Bayesian net- 
works over the traditional probabilistic models. The 
most notable is that the mlmber of probabilities 
needed to represent the distribution can be signif 
icantly reduced by making independence assump- 
tions between variables, with each node condition- 
154 
~A~ p(AIB, c) 
P(h,B,C3)3¢,l~)=l'(All~,C)xl'(BIb,i;)xP(Cl})) 
xP(\])llV,)xl)(l~)xl)(F) 
Figure :1: A n eXamlih; of a Bayesiml nel;work and the 
t)rolialiilities at ('a(:h node, thai (h'lin(; lhe rehtl.ion- 
shills 1)('tweeu a no(h; and its parents. '\]'he equation 
al; the l)ott()m shows how t.h(, (listrilmtion across all 
(if the variat)les is (:Oml)Ut(',(l. 
ally d(,l/endenl; ltl)on only il;s t)arenls (\]'earl, 1988). 
\]Pigllr(! \] shows all (LxCallll)\](; \]{ayesiall ll(H:Woll( rel)- 
r(;senl;ing l;h(' disl;ribul;i()n \])(;\,B;(',,1);F,,\]"). \]nsl;('a(l 
of having oue large (,able wilh 2" lirol)al)ilit.i(~s (with 
all \]}o()\](;an nod(!s), the (lisl;ribution is rel)resenl(;(t 
by the (:onditional I)rolial)ility tal)les (C,\])Ts) at ea('h 
nod(,, su(:h as I>(B I l), F), requiring a (olal of only 2d 
prol)al)ilitie~s. Not only (lo th(, savings he(:ome more 
significant with larger networks, lmi; (.he sparse data 
t/r()l)h;m b(!(:omes m()r(; manag(;abh, as well. Th(! 
I;l';/illitlg~ s(!l; 11o lollg:er lice(is Ix) (:()v(!r all l)(!l'lllllta- 
l;ions (if l;he f(!a(;m'e sets, lml only smaller sul)sets 
dictated l)y l.he sets of Valial)h!s (if lira CPTs. 
The n(;l.w(/rk shown in Figure 1 lo()ks simi- 
hn to any pOll;ion of th(~ \Vor(1N(~l; hierar(:hy for 
a reason. In BII\]), belief networks with the 
same stru(:tur(; as the \Vor(tNet hierar(:hy are au- 
tomatically (:onstru(:t;ed to rel)l"esent the seh'cl;ional 
I)reference of an a(1.iective sense. S1)ecifically, 
the network rei)resents the prol)aliilistic (tistribu- 
l;ion over all of the i)rotol;yl)i(:al nouns of an a(1- 
jective~i mid the nouns' semanti(: t'ealures, i.e., 
P(v,.ot,,,,,,,,..,., < v,.o~,oNF, > I,-!i#i). 'rh(; ,,s(, of 
Bayesian networks for WSI) has I)een l)rop()sed by 
others such as Wiebe eL al (1.998), but a different 
fl)rmulation is used in this mod('l. Th(, construction 
of the networks in BHD can be divided into three 
steps: defining 1) the training sets, 2) the structure, 
and 3) the probabilities, as described in the tbllowing 
sections. 
4.1 ~lYaining Sets 
The training set for each of the adje(:tive s(',nses is 
(;onstructed by ('.xi;ra(:l;ing l;he exenll)lary adje(:tive- 
noun pairs from tile WordNet glossary. The glossary 
contains the examlih; usage of the a(lje(:tiv('~s, and 
l;\]l(', nouns from thein are taken as the training sets 
" />, t'"il 
, -\ /'\ '""@t :':, 
(?.*ib,,}~ " ,'..,,,.,:p, ........ H 
/ , (,,,,:,,,,@,,. 
(,.,,~,) (,,& ,," ............... ~ ............... 
\ / 
a~",'"') (.,,,5 (.,,,4-. ..... ~ ............... 
Figure 2: The stru(:ture (if the belief n(,l;w(/rk that 
i'el)resenls the, s(~h~ctional l)refer(!n(:(~ (if flrcal,#\[. 
The leaf nodes are 1.he nouns within the training set, 
mid lhe int(~rm('xlial.(; no(les r('th;(:l; the ISA hierar(:hy 
fr()m \\;()r(1Nel;. The 1)rol)ahilities al each node at(', 
used i;o (lisanfl)iguat(! novel a(lj(,(:l;ive-noun 1)airs. 
for th(, a(1.i('(:iives. For exmnl)h! , the nouns "auk", 
"oak", "steak", "delay" and "amount" (:ompose the 
training set fl)r great#l (SellSC" lalT/e iv, size,). Note 
l\]lal; WordNet in(:huh'd "sl;e~al~" ill l;he glossary of 
greal;#\], l)ul: il; al)l)ears thai 1;\]le 9ood or e:ccelhe'H,l, 
5(!lise would lm lliOl'(! aplWol)rial;e. Neverlheh~ss, the 
}isis of exelnl)lary mmns are sysl:(;malit:aily rel;rh'vcd 
and nol ('dile(l. 
Th(" sets (if l/r()tolypi(:al n(/mlS f(/r each a(lj(~(:liv(~ 
sense have to lie (lisaml)igual;ed lie(:ause, the S(~lllall- 
li(: features (lifter 1)etween ambiguous nouns. Since 
these n(mns cmmol; lie autonmti(:ally disamlAguated 
with high accuracy, lhey have to be done mamm\]ly. 
This is the second t)art of (,lie mmmal process need(;d 
1) 3" BIID sitl(:(! the W(/rdNel; gh)ssary is not selnallti- 
(:ally tagged. 
4.2 Belief Network Struetm'e 
The belief networks have the same structure as the 
\VordNet 1SA hierarchy with the ext;et)tion that the 
edges are direcl;ed from the child nodes to their par- 
ents. Ilhlstrated in Figure 2, the BItD-eonstructed 
network represents the selectional t)referelme of the 
to I) level nod(;, great#l. The leaf nodes are tile evi- 
(len(:e nodes from the training set mM the internm(li- 
ale ilo(les are |;t1(; sema.ntie featm'es (if the leaf nodes. 
This organization emd)les the belief gathered from 
the leaf nodes 1;o lie tirot)agated u I) to the tol) level 
node during inferencing, as described in a later sec- 
LioIl. 1}11|; first, th(' l)robability table ae(:oml)anyiug 
ea(:h node needs to be constructed. 
155 
4.3 Quantifying the Network 
The two parameters the belief networks require are 
the CPTs tbr each intermediate node and the pri- 
ors of the leaf nodes, such as P(great#l, hurri- 
cane). The latter is estilnated by tile counts ob- 
tained fronl Altavista, as described earlier, and a 
shortcut is used to specit~y the CPTs. Normally 
the CPTs ill a flflly specified Bayesian network con- 
tain all instantiations of the child and parent values 
and their corresponding probabilities. For example, 
the CPT at node D in Figure 1 would have four 
rows: er(D=tlE:t), Pr(D=tlE=f), Pr(D=flE=t), 
and Pr(D=flE=f ). This is needed to perform flfll 
inferencing, where queries can be issued for any in- 
stantiation of the variables. However, since the net- 
works in this model are used only for one specific 
query, where all nodes are instantiated to be true, 
only the row with all w~riables equal to true, e.g., 
Pr(D=tlE=t), has to be specified. The nature of 
this query will be described in more detail in tile 
next section. 
To calculate the probability that an intermediate 
node and all of its parents are true, one divides the 
number of parents present by the number of possi- 
ble parents as specified ill WordNet. hi Figure 2, 
the small clotted nodes denote the absent parents, 
which deternfine how the probabilities are specified 
at each node. Recall that tile parents in the belief 
network are actually the children in the. WordNet 
hierarchy, so this probability can be seen as the per- 
centage of children actually present, hltuitively, this 
probability is a form of assigning weights to parts of 
the network where more related nouns are present 
in the training set, silnilar to the concept of seman- 
tic density. Tile probability, in conjunction with the 
structure of the belief network, also implicitly en- 
codes the semantic distance between concepts with- 
out necessarily 1)enalizing concepts with deep hier- 
archies. A discount is taken at each ancestral node 
during inferencing (next section) only wlmn some 
of its WordNet children are absent in the network. 
Therefore, the semantic distance can be seen as the 
number of traversals 11I) the network weighted by the 
number of siblings present in tile tree (and not by 
direct edge counting). 
4.4 Querying the Network 
With the probability between nodes specified, the 
network becomes a representation of the selectional 
prefbrence of an adjective sense, with features from 
the WordNet ISA hierarchy providing additional 
knowledge on both semantic densities and semantic 
distances. To disambiguate a novel adjective-noun 
pair such as "great flood", the great#l and great#2 
networks (along with 7 other great#/networks not 
shown here) infer the likelihood that "flood" be- 
longs to the network by comtmting the probability 
Pr(great, flood, <flood NFs>, proto nouns, <l)roto 
NFs> I adj #i), even though neither network has ever 
encountered the noun "flood" before. 
To perform these inferences, the noun and its fea- 
tures are tenlporarily inserted into the network ac- 
cording to the WordNet hierarchy (if not already 
present). The prior for this "hypothetical evidence" 
is obtained the same way as the training set, i.e., by 
querying Altavista, and the CPTs are updated to 
reflect this new addition. To calculate the probabil- 
ity at the top node, any Bayesian network inferenc- 
ing algorithm can be used. However, a query where 
all nodes are instantiated to true is a special case 
since the probability can be comlmted by multiply- 
ing together all priors and the CPT entries where all 
variables are true. 
Ill Figure 3, tile network for great#l is shown with 
"flood" as tile hypothetical evidence added on the 
right. The CPT of the node "natural phenomenon" 
is updated to reflect the newly added evidence. The 
propagation of the probabilities from the leaf nodes 
up the network is shown and illustrates how dis- 
counts art taken at each intermediate node. When- 
ever more related concepts are 1)resent in the net- 
work, such as "typhoon" and "tornado '~, less dis- 
counts are taken and thus a higher probability will 
result at the root node. Converso.ly, one can see that 
with a distal concept, such as "taste" (which is ill a 
completely different branch), the knowledge about 
"hurricane" will have little or no influence on dis- 
ambiguating "great taste". 
The calculation above can be computed in linear 
time with respect to the det/th of tlle query noun 
node (depth=5 in the case of flood#l) and not the 
the nmnber of nodes ill the network. This is impor- 
tant for scaling the network to represent the large 
nuinber of nouns needed to accurately model tile se- 
lectional preferences of adjective senses. Tile only 
cost incurred is storage for a summary probability 
of the children at each internlediate node and time 
for ut)dating these values when a new piece of evi- 
dence is added, which is also linear with respect to 
the depth of the node. 
Finally, the probabilities comt)uted by the infer- 
ence algorithnl are combilmd with the priors estab- 
lished in the earlier section. Tile combined proba- 
bilities represent P(adj#i \] adj, noun, <NFs>), and 
tilt one with the highest probability is classified by 
BHD as tile most plausible sense of the adjective. 
4.5 Evahmtion 
To test the accuracy of BHD, the same procedure 
described earlier was used. Tile same 135 adjective- 
noun pairs from SelnCor were disambiguated by 
BHD and compared to the baseline. Table 2 shows 
the accuracy results froln evaluating either the first 
sense of the nouns or all senses of the nouns. The re- 
sults of the accuracy without the priors Pr(adj#i) in- 
156 
1/16x 1/30x 
'I' 
¢i¢al) 2/-$ 
,S/cLmst x 1/~ 
Illka,d} 1/8 
t tlood) - 1770~ 
x 1/7 x I/I 
Figure 3: (~ll{2ry of {;he flrcrtt#-I 1}elief netw{)rl( 
t(} infe.r l;h{', probal}ility of tlood being m{}{lified l}y 
.q'reat#l.. The left branch of the network has 1}{'.en 
omitted for clarity. 
dicate~ the imI}rovements l}rovided by the likeliho{}{t 
term alone. The itnl}r(}vement gain(~{l from the ad- 
dii.ional conl;extual featm:es shows th(~ ell'{~{:liv{m{~ss 
of the belief networks, l!'~v(m with only 3 t}rol{)tyt)- 
ica.l mmns l}er a(ljecl,ive se.nse (}n av{u'ag{~ (hardly a 
COml)lete deseril)tion of the sel(B(:tional pr(ff(w(u\]{:{;s); 
the gain is very encouraging. Wit;h the 1)ri(}rs t'a{:- 
tored in, 13IID iml}r(}ved even flu:ttmr (81%), signifi- 
('antly surllassing the baseline (75.6%), a feat a{:{:om- 
plished 1)3; {rely (me (}ther m{}del that we are aware (}f 
(airi Sl;el;iIla and Nagao, 1998). Not;e thai; l;h{! l)esl 
~/.C{;lll;}/{;y \v~ls a{'hieved by evaluating all senses of th{~ 
ll(}llllS} }IS exi)e{:ted, since the sele{:l;iollal t)r{~.fer{,Jl{'e 
is modele{t l;hr{mgh senmnti(: feai;mes (}t" the glos- 
sary nouns, not just their word forms. The. r{;as{m 
for the good accuracy from using only the tirst noun 
sense is because 72% of them hal}pen to be the first; 
S0,IlSe. r\]_~heso results are very ('JH;ouragillg si\]lco 11o 
tagged corl)us and minimal training data were. used. 
\¥e believe that with a bigger training set, \]3HD's 
t)erf(}rnlance will iml}rove even further. 
4.6 Colnl)arison with Other Mo{h,Js 
rE() our kn{}wledge, there are {rely two oLher sysl;enls 
that (lisanfl)iguate a.(lj{~{:tive-noml llairs from unre- 
st.ri{:l;e{l l;ex{;. Results fl'om both models were evalu- 
ated against SemCor and thus a comparison is mean- 
ingful. In Table 3, each model's accuracy (as well as 
\'Vithout 
lh'ior 
With 
1}l.i{}r 
Context 
11Ol111 only 
+SP 
11Ol111 ollly 
+SP 
1. st IIOllll 
SOIISO 
56.3% 
6O.O% 
77.8% 
80.{}% 
all noun 
S{BIIH(~S 
53.3% 
60.0% 
77.8% 
81.4% 
Baseline 75.6% 75.6% 
Ta.1}le 2: Accuracy results from the seleetional pref- 
erence model (+SP), showing the improvements over 
the baseline by either considering the first llOllIl 
sense or all noun senses. 
Model Results \[Baseline 
HIII) 81.4% 75.6% 
Mihah:ea and Moldovan 79.8% 81.8% 
(1999) 
SW.tina el; al. (1998) 83.6% 81.9% 
Table 3: Colnt/arison of atljectiv(~ disalnl}iguation at:- 
curay with other inodels. 
lhe baseline) is provided since different adjet:tive- 
noun pairs were e, valuated. We find t;he BIt\]) re- 
suits conlpa.ralfle, if not bel;ter, espcc, ially when ihe 
}IlllOllll{; Of inq)rovenw, nl; ()vet th0, \])aseline is eonsi(l- 
ered. The Ino<lel 1) 3, ,SI;e,l;ina (1.998) was {,rai\]md tm 
SemCor that was merged with a flfll senl;ential parse 
tree, the determination of which is considered a dif- 
ficult l)rolflem of its own (Collins, 1997). \Ve belie, re 
thai; by int:ori}oral;ing tim data from SemC(n (dis- 
(:llSse(l ill I;he fllI;llre work sc, ct;ion)~ {;11o, \]}erforlll}lllCe 
(ff our sysi;em will surpass Stetina's. 
5 Conclusion and Future Work 
We have 1)resenl;etl a t)rol}~fl}ilistic disanfl}iguation 
model that ix sysl;e, ln~l;i{,, a(tcllral;e, all(l require llHlll- 
ual intervention ill only ill two places. The more 
l;illle (:OllSlllllill~ of tim l;wo manual l;asks is to (:\]as- 
sitS' th(~ toll 100 nouns needed for the priors. The 
el;her task~ of disanfl)igual;ing l)rol;olTpical llOllllS, is 
relal;ively simple due to the limited nunfl)er of glos- 
sary nouns per sense. IIowever, it would l}e straight- 
forward to incorporate semantically tagged corpora, 
such as SemCor, to avoid these mamml tasks. The 
priors are the number of instances of each adjective 
sense divided by all of the adjectives in the corpus. 
The disambiguated adjectiveT~i-noun#.\] pairs from 
the corpus ean be used as training sets to build bet- 
{e,r ret/resental;ion of selectional preferences l}y in- 
serting tim nounT~j node mid the ac(:omf}any featm'es 
into the l}elief network of a{ljectivegfii. The insertion 
is the same prot:c/hue used to add the hyllothe|;ical 
evidence dm'ing the inferoncing stage. The Ul)(lated 
belief networks could then be, used tbr disambigua- 
lion wii;h improve.d at:curacy. Furthernlore, the per- 
formance of BIID (:(mid a.lso be improved by exl)and- 
157 
ing the context or using statistical learning methods 
such as the EM algorithln (DemI)ster et al., 1977). 
Using Bayesian networks gives the model ttexibility 
to incorporate additional contexts, such as syntac- 
tical and morphological features, without incurring 
exorbitant costs. 
It is l)ossible that, with an extended model that 
accurately disambiguates adjective-noun pairs, the 
selectional preference of adjective senses coutd be au- 
tomatically learned. Having all improved knowledge 
al)out the selectional 1)references would then provide 
better parameters for disanfl)iguation. The model 
can be seen as a bootstrapping learning process tbr 
disambiguation, where the information gained from 
one part (selectional preference) is used to improve 
tile other (disambiguation) and vice versa, reminis- 
cent of the work by Riloff and Jones (1.999) and 
Yarowsky (1995). 
Lastly, the techniques used in this paper could be 
scaled to disambiguate not only all adjective-noun 
pairs, but also other word pairs, such as subject- 
verb, verb-object, adverb-verb, by obtaining most of 
the paraineters from the Internet and WordNet. If 
the information fi'oln SemCor is also used, then the 
system could be automatically trained to pertbrm 
disambiguation tasks on all content words within a 
SellteI1Ce. 
In this paper, we have addressed three of what we 
believe to be the main issues timed 1)y current WSD 
systems. We demonstrated the effectiveness of the 
teclmiques used, while identii~ying two mmmal tasks 
that don't necessarily require a semantically tagged 
corpus. By establishing accurate priors a.nd small 
training sets, our system achieved good initial dis- 
ambiguation accuracy. The salne methods could 1)e 
flflly automated to disami)iguate all content word 
pairs if infbrmation from semantically tagged cor- 
pora is used. Our goal is to create a system that can 
disambiguate all content words to an accuracy level 
sufficient for automatic tagging with tummn valida- 
tion, which could then be used to improve or fa- 
cilitate new probabilistic semantic taggers accurate 
enough for other NLP applications. 

References 

Eneko Agirre and German Rigau. 1996. Word sense 
dismnbiguation using conceptual density. In Proceedings of COLING-96, Copenhagen. 

Michael Collins. 1997. Three generative, lexicalised 
models for statistical parsing. In Proceedings of 
the 351h Annual Meeting of the ACL, pages 16-23, Madrid, Spain. 

A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. 
Maximum likelihood from incomplete data via the 
EM algorithm. Journal of the Royal Statistical 
Society, 39 (B): 1-38. 

Sadao Kurohashi Jiri Stetina and Makoto Nagao. 
1998. General word sense disambiguation method 
based on a full sentential context. In Proceedings 
of COLING-ACL Workshop on Usage of WordNet in Natural Language Processing, Montreal, 
Canada, .July. 

Rada Mihalcea and Dan Moldovan. 1998. Word 
sense disambiguation based on semantic density. 
In Proceedings of COLING-ACL Workshop on Usage of WordNet in Natural Language Proecssing, 
Montreal, Canada, July. 

G. Miller, C. Leacock, and R. Tengi. 1993. A semantic concordance. In Procccdings of ARPA Human 
Language Technology, Princeton. 

G. Miller. 1990. WordNet: An on-line lexical 
database. International Journal of Lexicography, 
3(4:). 

Hwee Tou Ng and Hian Beng Lee. 1996. Integrating 
multiple knowledge sources to disambiguate word 
sense: An exemplar-based approach In Proceedings of the 3/tth, Annual Meeting of ACL, Santa 
Cruz, June. 

Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 
Morgan Kaufmalm, San Mateo, CA. 

Philip Resnik and David Yarowsky. 1997. A perspective on word sense disambiguation methods 
and their evaluation. In ANLP Workshop on Tagging Text with, Lexical Semantics, Washington, 
D.C., June. 

Philip Resnik. 1997. Selectional preference and 
sense disambiguation. In ANLP Workshop on 
Tagging Text with, Lcxical Semantics, Washington, D.C., June. 

Ellen Riloff and Rosie Jones. 1999. Learning dic- 
tionaries fbr information extraction by multi-level 
bootstrapping. In Proceedings of AAAI-O9, Of 
lando, Florida. 

Janyce Wiebe, Tom O'Hara, and Rebecca Bruce. 
1998. Constructing bayesian networks from WordNet for word-sense disambiguation: Representational and Processing issues. In Proceedings of 
COLING-ACL Workshop on Usage of WordNet in 
Natural Language Processing, Montreal, Canada, 
July. 

David Yarowsky. 1992. Word-sense disambiguation using statistical model of Roget's categories trained on large corpora. In Proceedings of 
COLING-92, Nantes, France. 

David Yarowsky. 1995. Unsupervised word sense 
disambiguation rivaling supervised methods. In 
Proceedings of the 33rd Annual Meeting of the 
ACL. 
