Word Sense Disambiguation using a dictionary for sense similarity measure
Bruno Gaume Nabil Hathout Philippe Muller
IRIT – CNRS, UPS & INPT ERSS – CNRS & UTM IRIT – CNRS, UPS & INPT
Toulouse, France Toulouse, France Toulouse, France
gaume@irit.fr hathout@univ-tlse2.fr muller@irit.fr
Abstract
This paper presents a disambiguation
method in which word senses are deter-
mined using a dictionary. We use a seman-
tic proximity measure between words in the
dictionary, taking into account the whole
topology of the dictionary, seen as a graph
on its entries. We have tested the method on
the problem of disambiguation of the dic-
tionary entries themselves, with promising
results considering we do not use any prior
annotated data.
1 Introduction
Various tasks dealing with natural language
data have to cope with the numerous different
senses possessed by every lexical item: ma-
chine translation, information retrieval, infor-
mation extraction ... This very old issue is far
from being solved, and evaluation of methods
addressing it is far from obvious (Resnik and
Yarowsky, 2000). This problem has been tack-
led in a number of ways1: by looking at con-
texts of use (with supervised learning or un-
supervised sense clustering) or by using lexi-
cal resources such as dictionaries or thesauri.
The first kind of approach relies on data that
are hard to collect (supervised) or very sensitive
to the type of corpus (unsupervised). The sec-
ond kind of approach tries to exploit the lexical
knowledge that is represented in dictionaries or
thesaurus, with various results from its incep-
tion up to now (Lesk, 1986; Banerjee and Ped-
ersen, 2003). In all cases, a distance between
words or word senses is used as a way to find
the right sense in a given context. Dictionary-
based approaches usually rely on a comparison
of the set of words used in sense definitions and
1A good introduction is (Ide and Véronis, 1998), or
(Manning and Schütze, 1999), chap. 7.
in the context to disambiguate2.
This paper presents an algorithm which uses
a dictionary as a network of lexical items (cf.
sections 2 and 3) to compute a semantic simi-
larity measure between words and word senses.
It takes into account the whole topology of the
dictionary instead of just the entry of target
words. This arguably gives a certain robustness
of the results with respect to the dictionary. We
have begun testing this approach on word sense
disambiguation on definitions of the dictionary
itself (section 5), but the method is expected
to be more general, although this has not been
tested yet. Preliminary results are quite encour-
aging considering that the method does not re-
quire any prior annotated data, while operating
on an unconstrained vocabulary.
2 Building the graph of a
dictionary
The experiment we describe here has been
achieved on a dictionary restricted to nouns and
verbs only: we considered dictionary entries
classified as nouns and verbs and noun and verb
lemmas occurring within those entries. The
basic idea is to consider the dictionary as an
undirected graph whose nodes are noun entries,
and an edge exists between two nodes when-
ever one of them occur in the definition of the
other. More precisely, the graph of the dictio-
nary encodes two types of lexicographical in-
formations: (1) the definitions of the entries
sub-senses and (2) the structure of the entries
that is the hierarchical organisation of their sub-
senses. The graph then includes two types of
nodes: w-nodes used for the words that occur
2With the exceptions of the methods of (Kozima and
Furugori, 1993; Ide and Véronis, 1990), both based on
models of activation of lexical relations, but who present
no quantified results.
in the definitions and  -nodes used for the def-
initions of the sub-senses of the entries. The
graph is created in three phases:
1. For each dictionary entry, there is a  -
node for the entry as a whole and there is
one  -node for each of the sub-senses of
the entry. Then an edge is added between
each  -node and the  -nodes which rep-
resent the sub-senses of the next lower
level. In other words, the graph includes
a tree of  -nodes which encodes the hier-
archical structure of each entry.
2. A w-node is created in the graph for each
word occurring in a definition and an edge
is added between the w-node and the  -
node of that definition.
3. An edge is added between each w-node
and the top  -node representing the dic-
tionary entry for that word.
For instance, given the entry for "tooth"3:
1. (Anat.) One of the hard, bony appendages
which are borne on the jaws, or on other
bones in the walls of the mouth or pharynx
of most vertebrates, and which usually aid
in the prehension and mastication of food.
2. Fig.: Taste; palate.
These are not dishes for thy dainty tooth.
–Dryden.
3. Any projection corresponding to the tooth
of an animal, in shape, position, or office;
as, the teeth, or cogs, of a cogwheel; a
tooth, prong, or tine, of a fork; a tooth, or
the teeth, of a rake, a saw, a file, a card.
4. (a) A projecting member resembling a
tenon, but fitting into a mortise that
is only sunk, not pierced through.
(b) One of several steps, or offsets, in a
tusk. See Tusk.
We would consider one  -node for tooth as the
top-level entry, let us call it  0.  0 is con-
nected with an edge to the  -nodes  1,  2 , 3
and  4 corresponding to the senses 1., 2., 3.
3Source: Webster’s Revised Unabridged Dictionary,
1996. The experiment has actually been done on a
French dictionary, Le Robert.
and 4.; the latter will have an edge towards the
two  -nodes  4:1 and  4:2 for the sub-senses
4.a. and 4.b.;  4:1 will have an edge to each w-
node built for nouns and verbs occurring in its
definition (member, resemble, tenon, fit, mor-
tise, sink, pierce). Then the w-node for tenon
will have an edge to the  -node of the top-
level entry of tenon. We do not directly con-
nect  4:1to the  -nodes of the top-level entries
because these may have both w- and  -node
daughters.
In the graph,  -nodes have tags which indi-
cates their homograph number and their loca-
tion in the hierarchical structure of the entry.
These tags are sequences of integers where the
first one gives the homograph number and the
next ones indicate the rank of the sense-number
at each level. For instance, the previous nodes
 4:1 is tagged (0; 4; 1).
3 Prox, a distance between graph
nodes
We describe here our method (dubbed Prox) to
compute a distance between nodes in the kind
of graph described in the previous section. It is
a stochastic method for the study of so-called
hierarchical small-world graphs (Gaume et al.,
2002) (see also the next section). The idea is to
see a graph as a Markov chain whose states are
the graph nodes and whose transitions are its
edges, with equal probabilities. Then we send
random particles walking through this graph,
and their trajectories and the dynamics of their
trajectories reveal their structural properties. In
short, we assume the average distance a parti-
cle has made between two nodes after a given
time is an indication of the semantic distance
between these nodes. Obviously, nodes located
in highly clustered areas will tend to be sepa-
rated by smaller distance.
Formally, if G = (V; E) is an irreflexive
graph with jV j = n, we note [G] the n  n ad-
jacency matrix of G that is such that [G]i;j (the
ith row and jth column) is 1 if there is an edge
between node i and node j and 0 otherwise.
We note [ ^G] the Markovian matrix of G, such
that [ ^G]r;s = [G]r;sP
x2V ([G]r;x)
.
In the case of graphs built from a dictionary
as above,[ ^G]r;s is 0 if there is no edge between
nodes r and s, and 1=D otherwise, where D
is the number of neighbours of r. This is in-
deed a Markovian transition matrix since the
sum of each line is one (the graph considered
being connected).
We note [ ^G]i the matrix [ ^G] multiplied i
times by itself.
Let now PROX(G,i,r,s) be [ ^G]ir;s. This is thus
the probability that a random particle leaving
node r will be in node s after i time steps. This
is the measure we will use to determine if a
node s is closer to a node r than another node
t. Now we still have to find a proper value for
i. The next section explains the choice we have
made.
4 Dictionaries as hierarchical
small-worlds
Recent work in graph theory has revealed a set
of features shared by many graphs observed
"in the field" These features define the class
of "hierarchical small world" networks (hence-
forth HSW) (Watts and Strogatz, 1998; New-
man, 2003). The relevant features of a graph in
this respect are the following:
D the density of the network. HSWs typically
have a low D, i.e. they have rather few
edges compared to their number of ver-
tices.
L the average shortest path between two nodes.
It is also low in a HSW.
C the clustering rate. This is a measure of how
often neighbours of a vertex are also con-
nected in the graph. In a HSW, this feature
is typically high.
I the distribution of incidence degrees (i.e. the
number of neighbours) of vertices accord-
ing to the frequency of nodes (how many
nodes are there that have an incidence de-
gree of 1, 2, ... n). In a HSW network, this
distribution follows a power law: the prob-
ability P(k) that a given node has k neigh-
bours decreases as k  , with lambda > 0.
It means also that there are very few nodes
with a lot of neighbours, and a lot more
nodes with very few neighbours.
As a mean of comparison, table 1 shows the
differences between randoms graphs (nodes
are given, edges are drawn randomly between
nodes), regular graphs and HSWs.
The graph of a dictionary belongs to the class
of HSW. For instance, on the dictionary we
used, D=7, C=0.183, L=3.3. Table 2 gives a
few characteristics of the graph of nouns only
on the dictionary we used (starred columns in-
dicate values for the maximal self-connected
component).
We also think that the hierarchical aspect of
dictionaries (reflected in the distribution of in-
cidence degrees) is a consequence of the role
of hypernymy associated to the high polysemy
of some entries, while the high clustering rate
define local domains that are useful for dis-
ambiguation. We think these two aspects de-
termine the dynamics of random walks in the
graph as explained above, and we assume they
are what makes our method interesting for
sense disambiguation.
5 Word sense disambiguation
using Prox semantic distance
We will now present a method for disambiguat-
ing dictionary entries using the semantic dis-
tance introduced section (3).
The task can be defined as follows: we con-
sider a word lemma  occurring in the defini-
tion of one of the senses of a word  . We want
to tag  with the most likely sense it has in
this context. Each dictionary entry is coded as
a tree of senses in the graph of the dictionary,
with a number list associated to each sub-entry.
For instance for a given word sense of word W,
listed as sub-sense II.3.1 in the dictionary, we
would record that sense as a node W(2,3,1) in
the graph. In fact, to take homography into ac-
count we had to add another level to this, for
instance W(1,1,2) is sense 1.2 of the first ho-
mograph of word W. In the absence of an ho-
mograph, the first number for a word sense will
conventionally be 0.
Let G=(V,E) the graph of words built as ex-
plained section 2, [G] is the adjacency matrix
of G, and [ ^G] is the corresponding Markovian
matrix . The following algorithm has then been
applied:
1. Delete all neighbours of  in G, i.e. make
8x 2 V; [G] ;x = [G]x; = 0
2. Compute the new [ ^G]i where i is taken to
be 6
(with equal D) L C I
Random graphs small L small C Poisson Law
HSW small L high C power law
Regular graphs high L high C constant
Table 1: Comparing classes of graphs
nb nodes nb edges nb N* nb E* Diam* C* L*
Nouns 53770 392161 51511 392142 7 0.1829 3.3249
Nouns and sub-senses 140080 399969 140026 399941 11 0.0081 5.21
Table 2: Dictionary used
3. Let L be the line  in the result. 8k; L[k] =
[ ^G]i ;k
4. Let E = fx1; x2; :::; xng be the nodes cor-
responding to all the sub-senses induced
by the definition of  .
Then take xk = argmaxx2E(L[x])
Then xk is the sub-sense with the best rank ac-
cording to the Prox distance.
The following steps needs a little explana-
tion:
1 This neighbours are deleted because other-
wise there is a bias towards the sub-senses
of  , which then form a sort of "artificial"
cluster with respect to the given task. This
is done to allow the random walk to really
go into the larger network of senses.
2 Choosing a good value for the length of
the random walk through the graph is not
a simple matter. If it is too small, only lo-
cal relations appear (near synonyms, etc)
which might not appear in contexts to dis-
ambiguate (this is the main problem of
Lesk’s method); if it is too large, the dis-
tances between words will converge to a
constant value. So it has to be related in
some way to the average length between
two senses in the graph. A reasonable as-
sumption is therefore to stay close to this
average length. Hence we took i = 6 since
the average length is 5.21 (in the graph
with a node for every sub-sense, the graph
with a node for each entry having L=3.3)
6 Evaluating the results
The following methodology has been followed
to evaluate the process.
We have randomly taken about a hundred
of sub-entries in the chosen dictionary (out
of about 140,000 sub-entries), and have hand-
tagged all nouns and verbs in the entries with
their sub-senses (and homography number), ex-
cept those with only one sense, for a total of
about 350 words to disambiguate. For all pair
of (context,target), the algorithm gives a ranked
list of all the sub-senses of the target. Although
we have used both nouns and verbs to build the
graph of senses, we have tested disambiguation
first on nouns only, for a total of 237 nouns. We
have looked how the algorithm behaves when
we used both nouns and verbs in the graph of
senses.
To compare the human annotation to the au-
tomated one, we have applied the following
measures, where (h1; h2; ; :::) is the human tag,
and (s1; s2; ::) is the top-ranked system output
for a context i defined as the entry and the target
word to disambiguate:
1. if h1 = 0 then do nothing else the homo-
graph score is 1 if h1 = s1, 0 otherwise;
2. in all cases, coarse polysemy count = 1 if
h2 = s2, 0 otherwise;
3. in all cases, fine polysemy count = 1 if 8i
hi = si
Thus, the coarse polysemy score computes how
many times the algorithm gives a sub-sense that
has the same "main" sense as the human tag
(the main-sense corresponds to the first level in
the hierarchy of senses as defined above). The
fine polysemy score gives the number of times
the algorithm gives exactly the same sense as
the human.
To give an idea of the difficult of the task,
we have computed the average number of main
entry target system output human tag
correct bal#n._m.*0_3 lieu#n. 1_1_3 1_1_1
correct van#n._m.*2_0_0_0_0 voiture#n. 0_2 0_2_3
error phonétisme#n._m.*0 moyen#n. 1_1_1 2_1
error créativité#n._f.*0 pouvoir#n. 2_3 2_1
error acmé#n._m._ou_f.*0_1 phase#n. 0_1 0_4
Table 3: Detailed, main-sense evaluation of a couple of examples.
sub-senses and the number of all senses, for
each target word. This corresponds to a ran-
dom algorithm, choosing between all senses of
a given word. The expected value of this base-
line is thus:
 homograph score=Px 1/(number of ho-
mographs of x)
 coarse polysemy = Px 1/(number of main
sub-senses of x)
 fine polysemy = Px 1/(number of all sub-
senses of x)
A second baseline consists in answering al-
ways the first sense of the target word, since
this is often (but not always) the most common
usage of the word. We did not do this for homo-
graphs since the order in which they are given
in the dictionary does not seem to reflect their
importance.
Table 4 sums up the results.
7 Discussion
The result for homographs is very good but not
very significant given the low number of occur-
rences; this all the more true as we used a part-
of-speech tagger to disambiguate homographs
with different part-of-speech beforehand (these
have been left out of the computation of the
score).
The scores we get are rather good for coarse
polysemy, given the simplicity of the method.
As a means of comparison, (Patwardhan et
al., 2003) applies various measures of seman-
tic proximity (due to a number of authors), us-
ing the WordNet hierarchy, to the task of word
sense disambiguation of a few selected words
with results ranging from 0.2 to 0.4 with respect
to sense definition given in WordNet (the aver-
age of senses for each entry giving a random
score of about 0.2).
Our method already gives similar results
on the fine polysemy task (which has an
even harder random baseline) when using both
nouns and verbs as nodes, and does not focus
on selected targets.
A method not evaluated by (Patwardhan et
al., 2003) and using another semantic related-
ness measure ("conceptual density") is (Agirre
and Rigau, 1996). It is also based on a dis-
tance within the WordNet hierarchy. They used
a variable context size for the task and present
results only for the best size (thus being a
not fully unsupervised method). Their random
baseline is around 30%, and their precision is
43% for 80% attempted disambiguations.
Another study of disambiguation using a se-
mantic similarity derived from WordNet is (Ab-
ney and Light, 1999); it sees the task as a Hid-
den Markov Model, whose parameters are es-
timated from corpus data, so this is a mixed
model more than a purely dictionary-based
model. With a baseline of 0.285, they reach a
score of 0.423. Again, the method we used is
much simpler, for comparable or better results.
Besides, by using all connections simultane-
ously between words in the context to disam-
biguate and the rest of the lexicon, this method
avoids the combinatorial explosion of methods
purely based on a similarity measure, where ev-
ery potential sense of every meaningful word
in the context must be considered (unless ev-
ery word sense of words other than the target is
known beforehand, which is not a very realis-
tic assumption), so that only local optimization
can be achieved. In our case disambiguating
a lot of different words appearing in the same
context may result in poorer results than with
only a few words, but it will not take longer.
The only downside is heavy memory usage, as
with any dictionary-based method.
We have made the evaluation on dictionary
entries because they are already part of the net-
random first sense algorithm(n+v)
homographs 0.49 - 0.875 (14/16)
coarse polysemy 0.35 0.493 0.574 (136/237)
fine polysemy 0.18 0.40 0.346 (82/237)
Table 4: Results
work of senses, to avoid raising other issues too
early. Thus, we are not exactly in the context
of disambiguating free text. It could then be
argued that our task is simpler than standard
disambiguation, because dictionary definitions
might just be written in a more constrained and
precise language. That is why we give the score
when taking always the first sense for each en-
try, as an approximation of the most common
sense (since the dictionary does not have fre-
quency information). We can see that this score
is about 50% only for the coarse polysemy, and
40% for the fine polysemy, compared to a typ-
ical 70-80% in usual disambiguation test sets,
for similar sense dispersion (given by the ran-
dom baseline); in (Abney and Light, 1999), the
first-sense baseline gives 82%. So we could
in fact argue that disambiguating dictionary en-
tries seems harder. This fact remains however
to be confirmed with the actual most frequent
senses. Let us point out again that our al-
gorithm does not make use of the number of
senses in definitions.
Among the potential sources of improvement
for the future, or sources of errors in the past,
we can list at least the following:
 overlapping of some definitions for sub-
senses of an entry. Some entries of the
dictionary we used have sub-senses that
are very hard to distinguish. In order to
measure the impact of this, we should have
multiple annotations of the same data and
measure inter-annotator agreement, some-
thing that has not been done yet.
 part of speech tagging generates a few er-
rors when confusing adjectives and nouns
or adjectives and verbs having the same
lemma; this should be compensated when
we enrich the graph with entries for adjec-
tives.
 some time should be spent studying the
precise influence of the length of the ran-
dom walk considered; we have chosen a
value a priori to take into account the aver-
age length of a path in the graph, but once
we have more hand-tagged data we should
be able to have a better idea of the best
suited value for that parameter.
8 Conclusion
We have presented here an algorithm giving a
measure of lexical similarity, built from infor-
mation found in a dictionary. This has been
used to disambiguate dictionary entries, with a
method that needs no other source of informa-
tion (except part-of-speech tagging), no anno-
tated data. The coverage of the method depends
only on the lexical coverage of the dictionary
used. It seems to give promising results on dis-
ambiguating nouns, using only nouns or nouns
and verbs. We intend to try the method after
enriching the network of senses with adjectives
and/or adverbs. We also intend, of course, to try
the method on disambiguating verbs and adjec-
tives.
Moreover, the method can be rather straight-
forwardly extended to any type of disambigua-
tion by considering a context with a target
word as a node added in the graph of senses
(a kind of virtual definition). We have not
tested this idea yet. Since our method gives a
ranked list of sense candidates, we also con-
sider using finer performance measures, taking
into account confidence degrees, as proposed in
(Resnik and Yarowsky, 2000).

References

Steven Abney and Marc Light. 1999. Hiding
a semantic hierarchy in a markov model. In
ACL’99 Workshop Unsupervised Learning in
Natural Language Processing, University of
Maryland.

E. Agirre and G. Rigau. 1996. Word sense
disambiguation using conceptual density. In
Proceedings of COLING’96, pages 16–22,
Copenhagen (Denmark).

S. Banerjee and T. Pedersen. 2003. Extended
gloss overlaps as a measure of semantic re-
latedness. In Proceedings of the Eighteenth
International Conference on Artificial Intel-
ligence (IJCAI-03), Acapulco, Mexico.

B. Gaume, K. Duvignau, O. Gasquet, and M.-
D. Gineste. 2002. Forms of meaning, mean-
ing of forms. Journal of Experimental and
Theoretical Artificial Intelligence, 14(1):61–74.

N. Ide and J. Véronis. 1998. Introduction to
the special issue on word sense disambigua-
tion: The state of the art. Computational
Linguistics, 24(1).

N. Ide and J. Véronis. 1990. Word sense dis-
ambiguation with very large neural networks
extracted from machine readable dictionar-
ies. In Proceedings of the 14th International
Conference on Computational Linguistics
(COLING’90), volume 2, pages 389–394.

H. Kozima and T. Furugori. 1993. Similarity
between words computed by spreading acti-
vation on an english dictionary. In Proceed-
ings of the conference of the European chap-
ter of the ACL, pages 232–239.

M. Lesk. 1986. Automatic sense disambigua-
tion using machine readable dictionaries:
how to tell a pine code from an ice cream
cone. In Proceedings of the 5th annual in-
ternational conference on Systems documen-
tation, pages 24–26, Toronto, Canada.

C. Manning and H. Schütze. 1999. Founda-
tions of Statistical Natural Language Pro-
cessing. MIT Press.

M. E. J. Newman. 2003. The structure and
function of complex networks. SIAM Re-
view, 45:167–256.

S. Patwardhan, S. Banerjee, and T. Pedersen.
2003. Using measures of semantic related-
ness for word sense disambiguation. In Pro-
ceedings of the Fourth International Confer-
ence on Intelligent Text Processing and Com-
putational Linguistics (CICLING-03).

P. Resnik and D. Yarowsky. 2000. Distinguish-
ing systems and distinguishing senses: New
evaluation methods for word sense disam-
biguation. Natural Language Engineering,
5(2):113–133.

D.J. Watts and S.H Strogatz. 1998. Collective
dynamics of ’small-world’ networks. Na-
ture, (393):440–442.
