I 
I 
I 
I 
I 
I 
1 
General Word Sense Disambiguation Method 
Based on a Full Sentential Context 
Jiri Stetina Sadao Kurohashi Makoto Nagao 
Graduate School of Infomatics, Kyoto University 
Yoshida-honmachi, Sakyo, Kyoto, 606-8501, Japan 
{ st et £na, kuro, nagao } ~kuee. kyot o-u. ac. j p 
Abstract 
This paper presents a new general supervised word 
sense disambiguation method based on a relatively 
small syntactically parsed and semantically tagged 
training corpus. The method exploits a full senten- 
tial context and all the explicit semantic relations in 
a sentence to identify the senses of all of that sen- 
tence's content words. In spite of a very small train- 
ing corpus, we report an overall accuracy of 80.3% 
(85.7, 63.9, 83.6 and 86.5%, for nouns, verbs, adjec- 
tives and adverbs, respectively), which exceeds the 
accuracy of a statistical sense-frequency based se- 
mantic tagging, the only really applicable general 
disambiguating technique. 
1 Introduction 
Identification of the right sense of a word in a sen- 
tence is crucial to any successful Natural Language 
Processing system. The same word can have dif- 
ferent meanings in different contexts. The task of 
Word Sense Disambiguation is to determine the cor- 
rect sense of a word in a given context. 
In most cases the correct word sense can be iden- 
tified using only the words co-occurring in the same 
sentence. However, very often we also need to use 
the context of words that appear outside the given 
sentence. For this reason we distinguish two types 
of contexts: the sentential context and the discourse 
context. The sentential context is given by the words 
which co-occur with the word in a sentence and by 
their relations to this word, while the discourse con- 
text is given by the words outside the sentence and 
their relations to the word. The problem that arises 
here is that most of the co-occurring words are also 
polysemous, and unless disambiguated they cannot 
fully contribute to the process of disambiguation. 
The senses of these words, however, also depend on 
the sense of the disambiguated word and therefore 
there is a reciprocal dependency which we will try 
to resolve by the algorithm described in this paper. 
Table I: Percentage of nouns, verbs, adjectives and 
adverbs and average number of senses 
Category 
NOUNS 
VERBS 
ADJ ECTIVES 
ADVERBS 
TOTAL 
Number 
48,534 
26,674 
19,743 
I 1,804 
106,755 
% Average # 
of senses 
45.5 5.4 
25.0 10.5 
18.5 5.5 
11.0 3.? 
100.0 5.8 
2 The Task Specification 
For our work, we used the word sense definitions 
as given in WordNet (Miller, 1990), which is com- 
parable to a good printed dictionary in its cover- 
age and distinction of senses. Since WordNet only 
provides definitions for content words (nouns, verbs, 
adjectives and adverbs), we are only concerned with 
identifying the correct senses of the content words. 
Both for the training and for the testing of our 
algorithm, we used the syntactically analysed sen- 
tences of the Brown Corpus (Marcus, 1993), which 
have been manually semantically tagged (Miller et 
al., 1993) into semantic concordance files (SemCor). 
These files combine 103 passages of the Brown Cor- 
pus with the WordNet lexical database in such a way 
that every content word in the text carries both a 
syntactic tag and a semantic tag pointing to the ap- 
propriate sense of that word in WordNet. Passages 
in the Brown Corpus are approximately 2,000 words 
long, and each contains approximately 1,000 content 
words. 
The percentages of the nouns, verbs, adjectives 
and adverbs in the semantically tagged corpus, to- 
gether with their average number of Word Net senses, 
are given in Table I. Although most of the words 
in a dictionary are monosemous, it is the polyse- 
mous words that occur most frequently in speech 
and text. For example, over 80% of words in Word- 
Net are monosemous, but almost 78% of the content 
words in the tested corpus had more than one sense, 
as shown in Table 2. 
I 
! 
I 
! 
I 
I 
I 
n 
I 
! 
II 
II 
il 
Table 2: Percentage of polysemous word in the cor- 
~US 
Category 
NOUNS 
VERBS 
ADJECT\[VES 
ADVERBS 
TOTAL 
Number Polysemous 
48,534 38,279 
26,674 24,845 
19,743 13,315 
II,804 6,715 
106,755 83,154 
% 
78.9 
93.1 
67.4 
56.9 
77.9 
Assigning the most frequent sense (as defined by 
WordNet) to every content word in the used corpus 
would result in an accuracy of 75.2 %. Our aim is 
to create a word sense disambiguation system for 
identifying the correct senses of all content words 
in a gwen sentence, with an accuracy higher than 
would be achieved solely by a use of the most fre- 
quent sense. 
3 General Word Sense 
Disambiguation 
The aim of the system described here is to take any 
syntactically analysed sentence on the input and as- 
sign each of its content words a pointer to an ap- 
propriate sense in WordNet. Because the words in 
a sentence are bound by their syntactic relations, 
all the word's senses are determined by their most 
probable combination in all the syntactic relations 
derived from the parse structure of the given sen- 
tence. It is assumed here that each phrase has one 
central constituent (head), and all other constituents 
in the phrase modify the head (modifiers). It is 
also assumed that there is no relation between the 
modifiers. The relations are explicitly present in the 
parse tree, where head words propagate up through 
the tree, each parent receiving its head word from 
its head-child. Every syntactic relation can be also 
viewed as a semantic relationship between the con- 
cepts represented by the participating words. Con- 
sider, for example, the sentence (1) whose syntactic 
structure is given in Figure 1. 
(1) The Fulton County Grand Jury said 
Friday an investigation of Atlanta's recent 
primary election produced no evidence that 
any irregularities took place. 
Each word in the above sentence is bound by a 
number of syntactic relations which determine the 
correct sense of the word. For example, the sense 
of the verb produced is constrained by the subject- 
verb relation with the noun investigation, by the 
verb-object relation with the noun evidence and by 
the subordinate clause relation with the verb said. 
Similarly, the verb said is constrained by its rela- 
tions with the words Jury, Friday and produced; the 
sense of the noun investigation is constrained by the 
relation with the head of its prepositional phrase - 
election, and by the subject-verb relation with the 
verb produced, and so on. 
The key to extraction of the relations is that any 
phrase can be substituted by the corresponding tree 
head-word (links marked bold in Figure 1). To de- 
termine the tree head-word we used a set of rules 
similar to that described by (Magerman, 1995)(Je- 
linek et al., 1994) and also used by (Collins, 1996), 
which we modified in the following way: 
• The head of a prepositional phrase (PP-- IN 
NP) was substituted by a function the name of 
which corresponds to the preposition, and its 
sole argument corresponds to the head of the 
noun phrase NP. 
• The head of a subordinate clause was changed 
to a function named after the head of the first 
element in the subordinate clause (usually 'that' 
or a 'NULL' element) and its sole argument cor- 
responds to the head of its second element (usu- 
ally head of a sentence). 
Because we assumed that the relations within the 
same phrase are independent, all the relations are 
between the modifier constituents and the head of 
a phrase only. This is not necessarily true in some 
situations, but for the sake of simplicity we took the 
liberty to assume so. A complete list of applicable 
relations for sentence (I) is given in (2). 
(2) NP(NN P(County), NN P(Jury)) 
N P(NNP(Grand),NN P(Jury)) 
NP(NP(Atlanta),NP(election)) 
N P( J J(recent),N P(election)) 
N P(J J(primary),N N(election)) 
N P(N N(in vestigation), P P(of(election))) 
S(N P(irregularities),VP(took)) 
V P(VB D(took),NP(place)) 
N P(NN(evidence),SBAR(that(took)) 
S( N P(investigation),VP(produced)) 
V P(VB D(produced),N P(evidence)) 
VPIVBD(said),N P(Friday)) 
V P(VBD(said),S BA R(0(produced))) 
S(NP(Jury),VP(said)) 
Each of the extracted syntactic relations has a cer- 
tain probability for each combination of the senses 
of its arguments. This probability is derived from 
the probability of the semantic relation of each com- 
bination of the sense candidates of the related con- 
tent words. Therefore, the approach described here 
consists of two phases: 1. learning the semantic re- 
lations, and 2. disambiguation through the proba- 
bility evaluation of relations. 
4 Learning 
At first, every content word in every sentence in the 
training set was tagged by an appropriate pointer to 
a sense in WordNet. 
Secondly, using the parse trees of all the corpus 
sentences, all the syntactic relations present in the 
VP (said) 
DT NP NP 
NNP NNP NNP NNP I I I I 
The Fulton County Grand Jury 
....... °o°°°ooo ..... °° ............. . 
"" SBAR (d~u(wok)) 
IN(d~) $ (~onk) 
T T NN 
any irregul~ities t p~e 
L ........ °°oo°°°...°.°°o. .... o ....  . 
VBD NP ,~~roduced)) 
NNP NONE I 
°IT I I o~ I~N "*'SBAR an inve~igation IN NP (el.) p~ed 
I 
of NP POS NP (election) no I / 
NNP / JJ NP (election) 
I I Atlanm's recent Imrna:y election 
I !°~o.~o... oo°. 
it--"i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
Figure 1: Example parse tree 
training corpus were extracted and converted into 
the following form: 
(4) reI(PNT, MNT, HNT, MS, HS, RP). 
where PNT is the phrase parent non-terminal, MNT 
the modifier non-terminal, HNT the head non- 
terminal, MS the semantic content (see below) of 
the modifier constituent, \['IS the semantic content 
of the head constituent and RP the relative posi- 
tion of the modifier and the head (RP=I indicates 
that the modifier precedes the head, while for RP=2 
the head precedes the modifier). Relations involving 
non-content modifiers were ignored. Synsets of the 
words not present in WordNet were substituted by 
the words themselves. 
The semantic content was either a WordNet sense 
identificator (synset) or, in the case of prepositional 
and subordinate phrases, a function of the preposi- 
tion (or a null element) and the sense identificator 
of the second phrase constituent. 
5 Disambiguation Algorithm 
As mentioned above, we assumed that all the con- 
tent words in a sentence are bound by a number of 
syntactic relations. Every content word can have 
several meanings, but each of these meanings has a 
different probability, which is given by the set of se- 
mantic relations in which the word participates. Be- 
cause every relation has two arguments (head and its 
modifier), the probability of each sense also depends 
on the probability of the sense of the other partic- 
ipant in the relation. The task is to select such a 
combination of senses for all the content words, that 
the overall relational probability is maximal. If, for 
any given sentence, we had extracted N syntactic re- 
lations PA, the overall relational probability for the 
combination of senses X would be: 
N 
(5) ORP(X) = I-\[ p(&IX) 
i=l 
where p(RiIX) is the probability of the i-th relation 
given the combination of senses X. If we consider, 
that an average word sense ambiguity in the used 
corpus is 5.8 senses, a sentence with 10 content words 
would have 5.8 t° possible sense combinations, lead- 
ing to a combinatorial explosion of over 43,080,420 
overall probability combinations, which is not feasi- 
ble. Also, with a very small training corpus, it is 
not possible to estimate the sense probabilities very 
accurately. Therefore, we have opted for a hierar- 
chical disambiguation approach based on similarity 
measures between the tested and the training rela- 
tions, which we will describe in Section 5.2. At first, 
however, we will describe the part of the probabilis- 
tic model which assigns probability estimates to the 
individual sense combinations based on the semantic 
relations acquired in the learning phase. 
5.1 Relational Probability Estimate 
Consider, for example, the syntactic relation be- 
tween a head noun and its adjectival modifier de- 
rived from NP~ JJ NN. Let us assume that the 
number of senses in WordNet is k for the adjective 
and 1 for the noun. The number of possible sense 
combinations is therefore m = k • 1. The probability 
estimate of a sense combination (i,j) in the relation 
R, where i is the sense of the modifier (adjective in 
this example) and j is the sense of the head (noun 
in this example), is calculated as follows: 
fR(i,j) 
i (6)pR(i,j) = ~ t 
o=lp=t 
I \[~j) is a sco !e ~f co-occurrence', 
x with a had word sense 
.mantic rel. Lti >ns R extract.d 
lase. Pleas~ n ~te, that beta as~ 
I ~ but rather a 5core of co-oc~ ur 
v), pR(i~j) is not a real plob 
rather its approximation. Because tile 
i count is replaced by a similarity score, the sparse 
data problem of a small training corpus is substan- 
tially reduced. The score of co-occurrences is de- 
fined as a sum of hits of similar pairs, where a hit is 
I a multiplication of the similarity measures, sim(i,x) 
and sim(j,y), between both participants, i.e.: 
r 
I (7) fR(i,j)= ~= sim(i,z), sim(j,y) 
where x, yER; r is the number of rela- 
tions of the same type (for the above example 
I R=reI(NP,ADJ,NOUN,x,y,1)) found in the training 
corpus. To emphasise the sense-restricting contri- 
bution of each example found, every pair (x,y) is 
i restricted to contributing to only one sense combina- 
tion (id): every example pair (x,y) contributes only 
to such a combination for which sim(i, x) * sim(j, y) 
is maximal. 
I fR0,j) represents a sum of all hits in the train- 
ing corpus for the sense combination (ij). Because 
the similarity measure (see below) has a value be- 
tween 0 and 1 and each hit is a multiplication of 
I two similarities, its value is also between 0 and 1. 
The reason why we used a multiplication of simi- 
larities was to eliminate the contributions of exam- 
i pies in which one participant belonged to a com- 
pletely different semantic class. For example, the 
training pair new airport, makes no contribution to 
the probability estimate of any sense combination of 
i a new management, because none of the two senses of II 
the noun management (group or human activity) be- 
longs to the same semantic class as airport (entity). 
On the other hand, new airport would contribute to 
I the probability estimate of the sense combination of 
modern building because one sense of the adjective 
modern is synonymous to one sense of the adjective 
i new, and one sense of the noun building belongs 
to the same conceptual class (entity) as the noun 
airport. The situation is analogous for all other re- 
lations. The reason why we used a count modified 
I by the semantic distances, rather than a count of 
exact matches only, was to avoid situations where 
no match would be found due to the sparse data, a 
problem of many small training corpora. 
I Every semantic relation can be represented by a 
relational matrix, which is a matrix whose first 
coordinate represents the sense of the modifier, the 
I 
il 
where fl~(id) is a score of co-occurrences of a mod- 
ifier sense x with a head word sense y, among 
the same semantic relations R extracted during the 
learning phase. Please note, because fR(ij) is 
not a count but rather a score of co-occurrences (de- 
fined below), pR(i,j) is not a real probability but 
Because the occurrence 
4 
second coordinate represents the sense of the head 
and the value at the coordinate position (i j) is the 
estimate of the probability of the sense combination 
(id) computed by (6). An example of a relational 
matrix for an adjective-noun relation modern build- 
ing based on two training examples (new airport and 
classical music) is given in Figure 3. Naturally, the 
more the examples, the more fields of the matrix get 
filled. The training examples have an accumulative 
effect on the matrix, because the sense probabilities 
in the matrix are calculated as a sum of 'similarity 
based frequency scores' of all examples (7) divided 
by the sum of all matrix entries, (6). The most likely 
sense combination scores the highest value in the 
matrix. Each semantic relation has its own matrix. 
The way all the relations are combined is described 
in Section 5.2. 
5.1.1 Semantic Similarity 
We base the definition of the semantic similarity 
between two concepts (concepts are defined by their 
WordNet synsets a,b) on their semantic distance, as 
follows: 
(8) sire(a, b) "-- 1 - sd(a, b) ~-, 
The semantic distance sd(a,b) is squared in the 
above formula in order to give a bigger weight to 
closer matches. 
The semantic distance is calculated as follows. 
Semantic Distance for Nouns and Verbs 
I DI - D D2- D) sd(a,b) = {. ( ~ + -D-2 
where DI is the depth of synset a, D2 is the depth 
of synset D2, and D is the depth of their nearest 
common ancestor in the WordNet hierarchy. If a 
and b have no common ancestor, sd(a,b) = 1. 
If any of the participants in the semantic distance 
calculation is a function (derived from a preposi- 
tional phrase or subordinate clause), the distance is 
equal to the distance of the function arguments for 
the same functor, or equals 1 for different functors. 
For example, sd(of(sensel), of(sense2)) = sd(sense 1, 
sense2), while sd(of(senset), about(sense2)) = t, no 
matter what sensel and sense2 are. 
Semantic Distance for Adjectives 
sd(a,b) = 0 for the same adjectival synsets 
(inci.synonymy), 
sd(a,b) = 0 for the synsets in antonymy relations, 
i.e. for ant(a,b), 
sd(a,b) = 0.5 for the synsets in the same similarity 
cluster, 
sd(a,b) = 0.5 if a belongs to the same similarity 
cluster as c and b is the antonymy of c (indirect 
antonymy), 
sd(a,b) = I for all other synsets. 
I 
I 
I 
I 
I 
! 
I 
i 
I 
I 
I 
I 
I 
I 
I 
I 
t 
I 
I 
ENTITY I 
PHYSICAL OBJECT I 
ARTIFACT 
FA~IUTY STRUCTUREI 
AIRFIELD BUILDING(I) 
I 102207842 
AIRPORT 
102055456 
'°N 
RAW MATERIALS 100313161 I BULDING(3) 
BULDING(2) to611462 
100506493 
Example to disambiguate: MODERN(X) BUILDING(Y): reI(NP,ADJ,NOUN,X,Y,1) 
Example training set: NEW(9) AIRPORT: reI(NP,ADJ,NOUN,3006112602,102055456,1) 
CLASSICAL(I) MUSIC(3): reI(NP,ADJ,NOUN,300306289,100313161,1) 
sd(AIRPORT, BUILDING(1)) = 1/2(3/6+2/5) = 0.45 
sd(AIRPORT,BUILDING(2)) = 1.0 
sd(AIRPORT, RUILDING(3)) = 1.0 
sd(BUILDING(2),BUILDING(3)) = 1/2(4/5+4/5) = 0.8 
sd(BUILDING(2),MUSIC(3)) = 1/2(3/5 + 2/4) = 0.55 
sire(AIRPORT, BUILDING(I)) = 1-0.452 = 0.8 
Relational matrix: 
BUILDING(I) 0.0 0.0 0.0 0.0 0.6~ I 
BUILDING(2) 0.0 0.0 0.4 0.0 0.0 ~,1 ! 
BUILDING(3) 0.0 0.0 .,: 0.0 0.0 0.0 .j .......... o" 
sim(MUSIC(3),BUILDING(2)) = 1-0.552= 0.7 ,-°°'-¢a¢ :~O ~ z~: -~¢ z¢ 
sim(CLASSICAL(1)'MODERN(3)) = 1-0"52= 0"75 ~' :* i ~ ~ ~O ~ ZO O~z~O ' <m O0 zO -- :~0 mo 
fR(5, I) = sim(NEW(9),MODERN(5))" sire(AIRPORT, BUILDING(1)) = ~ >'n ~ =1.0"0.8=0.8 zz -~z z 
: 
fR(3,2) = sire(CLASSICAL( 1),MOO ERN(3)) * sim(MUSIC(3),BUILDING(2)) = ." ~ c~ ~ m~:'~ ~'~ :: = 0.75 "0.7 = 0.53 : -r me" ~ : 
pR(3,2) = fR(3,2)/sum(fR(i,j)) = 0.53/(0.8+0.53) = 0.4 ........................... " ~ ..... ,,-" 
pR(5,1) fR(5,l)/sum(fR(i,j)) = 0.8/(0.8+0.53) = 0.6 :1 ........................ . ........... * ......................... ...°"° 
Figure 2: Relational matrix based on two training examples 
Semantic Distance for Adverbs 
sd(a,b) = 0 for the same synsets (incl.synonymy), 
sd(a,b) = 0 for the synsets in antonymy relation 
ant(a,b), 
sd(a,b) = I for all other synsets. 
5.2 Hierarchical DisambJguation 
This section describes the main part of the algo- 
rithm, i.e. the disambiguation process based on 
the overall probability estimate of sententia\] rela- 
tions. As we have outlined above, for computational 
reasons, it is not feasible to evaluate overall proba- 
bilities for all the sense combinations. Instead, we 
take advantage of the hierarchical structure of each 
sentence and arrive at the optimum combination of 
its word senses, in a process which has two parts: 
\[. bottom-up propagation of the head word sense 
scores and 2. top-down disambiguation. 
5.2.1 Bottom-up head word sense score 
propagation 
\[n compliance with our assumption that all the 
semantic relations are only between a head word 
and its modifiers at any syntactic level, the modi- 
fiers do not participate in any relation with an ele- 
ment outside their parent phrase. As depicted in the 
example in Figure l, it is only the head word con- 
cepts that propagate through the parse tree and that 
participate in semantic relations with concepts on 
other levels of the parse tree. The modifiers (which 
are heads themselves at lower tree levels), however, 
play an important role in constraining the head-word 
senses. The number of relations derived at each level 
of the tree depends on the number of concepts that 
modify the head. Each of these relations contributes 
to the score of each sense of the head word. We de- 
fine the sense score vector of a word w as a vector 
of scores of each WordNet sense of the word w. The 
initial sense score vector of the word w is given 
by its contextually independent sense distribution 
in the whole training corpus. Because the training 
corpus is relatively small, and because it always ex- 
cludes the tested file, an appropriate sense of the 
word w may not be present in it at all. Therefore, 
each sense i of the word w is always given a non-zero 
initial score Pi(W) (ga): 
5 
I 
I 
i 
I 
I 
(9a)pi(w) = , count(w), + 1 
 (cou.t(w)i + l) 
j=t 
where count(w), is the number of occurrences of the 
sense i of the word w in the entire training corpus, 
and n is the number of different WordNet senses of 
the word w. 
The sense score vectors of head words propagate 
up the tree. At each level, they are modified by 
all the semantic relations with their modifiers which 
occur at that level. Also, the sense score vectors of 
head words are used to calculate the matrices of the 
sense score vectors of the modifiers. This is done as 
follows: 
Let H -- Jill, h2 .... , hit\] be the sense score vector 
of the head word h. Let T = \[R1, R2, ...Rn\] be a 
set of relations between the head word h and its 
modifiers. 
1. For each semantic relation R, E T between the 
head word h and a modifier mi with sense score 
vector Mi = loll, oi2 .... oil\], do: 
1.1 Using (6), calculate the relational matrix 
Ri(m,h) of the relation Ri 
1.2 For each ol E Mi multiply all the elements 
of the Ri(m,h) for which m=oi by oi, 
yielding Qi - the sense score matrix of 
the modifier mi 
2. The new sense score vector of the head word h 
is now G-" \[gl,g2, ...,gk\], where 
Lj (lo)g i = 2--, h~ 
Lj/L represents the score of the head word 
sense j based on the matrices Q calculated in 
the step 1., i.e.: 
(ll) Lj = ~ maz(zi(j, u)) 
i=I 
where xi(j,u)E Qi and max(xi(j,u)) is the 
highest score in the line of the matrix Qi which 
corresponds to the head word sense j. n is the 
number of modifiers of the head word h at the 
current tree level, and 
k 
i Lj = j~l Lj 
where k is the number of senses of the head 
word h. 
The reason why gj (I0) is calculated as a sum of 
the best scores (ll), rather than by using the tradi- 
tional maximum likelihood estimate (Berger et al., 
1996)(Gah eta\[., 1993), is to minimise the effect of 
the sparse data problem. Imagine, for example, the 
phrase VP-- VB NP PP, where the head verb VB 
is in the object relation with the head of the noun 
phrase NP and also in the modifying relation with 
the head of the prepositional phr~e PP. Let us also 
assume that the correct sense of the verb VB is a. 
Even if the verb-object relation provided a strong 
selectional support for the sense a, if there was no 
example in the training set for the second relation 
(between VB and PP) which would score a hit for the 
sense a, multiplying the scores of that sense derived 
from the first and from the second relation respec- 
tively, would gain a zero probability for this sense 
and thus prevent its correct assignment. 
The newly created head word sense score vector G 
propagates upwards in the parse tree and the same 
process repeats at the next syntactic level. Note 
that at the higher level, depending on the head ex- 
traction rules described in section 3, the roles may 
be changed and the former head word may become a 
modifier of a new head (and participate in the above 
calculation as a modifier). The process repeats itself 
until the root of the tree is reached. The word sense 
score vector which has reached the root, represents a 
vector of scores of the senses of the main head word 
of the sentence (verb said in the example in Figure 
1), which is based on the whole syntactic structure 
of that sentence. The sense with the highest score is 
selected and the sentence head disambiguated. 
5.2.2 Top-down Disambiguation 
Having ascertained the sense of the sentence head, 
the process of top-down disambiguation begins. The 
top-down disambiguation algorithm, which starts 
with the sentence head, can be described recursively 
as follows: 
Let 1 be the sense of the head word h on the in- 
put. Let M-\[ml,m2,...,mx\] be the set of the 
modifiers of the head word h. For every modifier 
mi E M, do: 
l. In the sense score matrix Qi of the modifier mi 
(calculated in step 1.2 of the bottom-up phase) 
find all the elements x(ki,l), where I is the sense 
of the head h 
2. Assign the modifier mi such a sense k--k' for 
which the value x(ki,l) is maximum. In the 
case of a draw, choose the sense which is listed 
as more frequent in WordNet. 
3. If the modifier mi has descendants in the parse 
tree, call the same algorithm again with ml be- 
ing the head and k being its sense, else end. 
The disambiguation of the modifiers (which become 
heads at lower levels of the parse tree), is based 
solely on those lines of their sense score matrices 
which correspond to the sense of the head they are 
in relation with. This is possible because of our as- 
sumption that the modifiers are related only to their 
head words, and that there is no relation among the 
modifiers themselves. To what extent this assump- 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
! 
I 
i 
I 
I 
Table 3: Number of words with the same and different sense as its previous occurrence in the same discourse 
(shortened) 
Has predecessor with the same sense 
Distance 
anywhere 
NOUNS 
\[' 15,373 
VERBS 
6,923 
<10 9,474 3,697 
<5 6,892 2,426 
5,964 
4,797 
3,039 
<3 
<2 
2,065 
1,578 
986 
ADJs ADVs 
5,523 3812 
2,733 1672 
1,834 1000 
1,566 841 
1,219 614 
733 348 
NOUNS VERBS 
2,057 5,227 
649 2,521 
355 1,561 
290 1,269 
208 929 
103 555 
ADJs ADVs 
933 830 
258 214 
104 135 
104 82 
83 55 
42 27 
tion holds in real life sentences, however, has yet to 
be investigated. 
6 Discourse Context 
(Yarowsky, 1995) pointed out that the sense of a tar- 
get word is highly consistent within any given doc- 
ument (one sense per discourse). Because our al- 
gorithm does not consider the context given by the 
preceding sentences, we have conducted the follow- 
ing experiment to see to what extent the discourse 
context could improve the performance of the word- 
sense disambiguation: 
Using the semantic concordance files (Miller et al., 
1993), we have counted the occurrences of content 
words which previously appear in the same discourse 
file. The experiment indicated that the "one sense 
per discourse" hypothesis works fairly well for nouns, 
however, the evidence is much weaker for verbs, ad- 
verbs and adjectives. Table 3 shows the numbers of 
content words which appear previously in the same 
discourse with the same meaning (same synset), and 
those which appear previously with a different mean- 
ing. The experiment also confirmed our expectation 
that the ratio of words with the same sense to those 
with a different sense, depends on the distance of 
sentences in which the same words appear (distance 
I indicates that the same word appeared in the pre- 
vious sentence, distance 2 that the same word was 
present 2 sentences before, etc.). 
We have modified the disambiguation algorithm 
to make use of the information gained by the above 
experiment in the following way: All the disam- 
biguated words and their senses are stored. The 
words of all the input sentences are first compared 
with the set of these stored word-sense pairs. If the 
same word is found in the set, the initial sense score 
assigned to it by (ga) is modified using Table 3, so 
that the sense, which has been previously assigned 
to the word, gets higher priority. The calculation of 
the initial sense score (9a) is thus replaced by (9b): 
(9b)pi(w) = nc°unt(w)i + 1 *e(POS, SN) 
E(count(w)j + l) 
J=! 
Table 4: Result Accuracy \[%\] 
CONTEXT NOUNS VERBS ADJs ADVs TOTAL 
First sense 77.8 61.7 81.9 84.5 75.2 
Sentence 84.2 63.6 82.9 86.3 79.4 
+Discourse 85.7 63.9 83.6 86.5 80.3 
where e(POS,SN) is the probability that the word 
with syntactic category POS which already occurred 
SN sentences before, has the same sense as its previ- 
ous occurrence. If, for example, the same noun has 
occurred in the previous sentence (SN=I) where it 
was assigned sense n, the probability of sense n of 
the same noun in the current sentence is multiplied 
by e(NOUN,I)=3,039/(3,039+I03)=0.967, while all 
the probabilities of its remaining senses are multi- 
plied by I-0.967=0.033. Ifno match is found, i.e. the 
word has not previously occurred in the discourse, 
e(POS,SN) is set to 1 for all senses. 
7 Evaluation 
To evaluate the algorithm, we randomly selected 15 
files (with a total of 18,413 content words tagged in 
SemCor) from the set of 103 files of the sense tagged 
section of the Brown Corpus. Each tested file was 
removed from the set and the remaining 102 files 
were used for learning (Section 4). Every sense as- 
signed by the hierarchical disambiguation algorithm 
(Section 5) was compared with the sense from the 
corresponding semantic concordance file. Table 4 
shows the achieved accuracy compared with the ac- 
curacy which would be achieved by a simple use of 
the most frequent sense. 
As the above table shows, the accuracy of the word 
sense disambiguation achieved by our method was 
better than using the first sense for all lexicai cate- 
gories. In spite of a very small training corpus, the 
overall word sense accuracy exceeds 80%. 
8 Related Work 
To our knowledge, there is no current method which 
attempts to identify the senses of all words in whole 
7 
! 
! 
sentences, so we cannot make a practical compari- 
son. 
Similarly to our work, (Resnik, 1995)(Agirre and 
Rigau, 1996) challenge the fine-grainedness of Word- 
Net, but their work is limited to nouns only. (Agirre 
and Rigau, 1996) report coverage 86.2%, precision 
71.2% and recall 61.4% for nouns in four randomly 
selected semantic concordance files. From among 
the methods based on semantic distance, (Reanik, 
1993)(Sussna, 1993) use a similar semantic distance 
measure for two concepts in WordNet, but they also 
focus on selected group of nouns only. (Karov and 
Edelman, 1996) use an interesting iterative algo- 
rithm and attempt to solve the sparse data bottle- 
neck by using a graded measure of contextual sim- 
ilarity. They achieve 90.5, 92.5, 94.8 and 92.3 per- 
cent accuracy in distinguishing between two senses 
of the noun drug, sentence, suit and player, re- 
spectively. (Yarowsky, 1995), whose training corpus 
for the noun drug was 9 times bigger than that of 
Karov and Edelman, reports 91.4% correct perfor- 
mance improved to impressive 93.9% when using the 
"one sense per discourse" constraint. These meth- 
ods, however, focus on only two senses of a very 
limited number of nouns and therefore are not com- 
parable with our approach. 
9 Conclusion 
This paper presents a new general approach to word 
sense disambiguation. Unlike most of the existing 
methods, it identifies the senses of all content words 
in a sentence based on an estimation of the overall 
probability of all semantic relations in that sentence. 
By using the semantic distance measure, our method 
reduces the sparse data problem since the training 
examples and their contexts do not have to match 
the disambiguated words exactly. All the semantic 
relations in a sentence are combined according to the 
syntactic structure of the sentence, which makes the 
method particularly suitable for integration with a 
statistical parser into a powerful Natural Language 
Processing system. The method is designed to work 
with any type of common text and is capable of dis- 
tinguishing among many word senses. It has a very 
wide scope of applicability and is not limited to only 
one part-of-speech. 

References 
E. Agirre and G. Rigau. 1996. Word sense disam- 
biguation using conceptual density. In Proc. of 
COLLING, pages 16-22. 
A. Berger, V. Pietra, and S. Pietra. 1996. A max- 
imum entropy approach to natural language pro- 
cessing. Computatzoaal Linguistics, 1(22):39-72. 
M. Collins. 1996. A new statistical parser based on 
bigram lexical dependencies. In Proc. of the .?4th 
Annual Meeting of the ACL, pages 184-191. 
W. Gale, K. Church, and D. Yarowsky. 1993. A 
method for disambiguating word senses in a large 
corpus. Computers and Humanities, (26):415- 
4397. 
F. Jelinek, J. Lafferty, D. Magerman, R. Mercer, 
A. Rathnaparkhi, and S. Roukos. 1994. Decision 
tree parsing using a hidden derivation model. In 
Proc. of the ARPA Human Language Technology 
Workshop, pages 272-277. 
Y. Karov and S. Edelman, 1996. Learning 
similarity-based word sense disambiguation from 
sparse data. In Proc. of the 3rd Workshop on Very 
Large Corpora, pages 42-55. 
D. Magerman. 1995. Statistical decision-tree mod- 
els for parsing. In Proc. of the 33rd Annual Meet- 
in 9 of ACL, pages 276-283. 
M. Marcus. 1993. Building a large annotated corpus 
of english: The penn treebank. Computational 
Linguistics, 2( 19):313-330. 
G. Miller, C. Leacock, and R. Tengi. 1993. A seman- 
tic concordance. In Proc. of the ARPA Human 
Language Technology Workshop, pages 303-308. 
G Miller. 1990. Wordnet: An on-line lexical 
database. International Journal of Lexicography, 
3(4):235-312. 
P. Resnik. 1993. Semantic classses and syntactic 
ambiguity. In Proc. of the APRA Human Lan- 
guage Technology Workshop, pages 278-283. 
P. Resnik. 1995. Disambiguating noun groupings 
with respect to wordnet senses. In Proc. of the 
3rd Workshop on Very Large Corpora. 
M. Sussna. 1993. Word sense disambiguation for 
free-text indexing using a massive semantic net- 
work. In Proc. of Second International Confer- 
ence on Information and Knowledge Management, 
pages 67-74. 
D. Yarowsky. 1995. Unsupervised word sense disam- 
biguation rivaling supervised methods. In Proc. o/ 
the 32nd Annual Meeting of the ACL. 
