Lexical Disambiguation using Simulated Annealing 
Jim Cowie, Joe Guthrie* and Louise Guthrie 
Computing Research Laboratory 
Box 30001 
New Mexico State University 
Las Cruces, NM 88003-0(301 
ABSTRACT 
The resolution of lexical ambiguity is 
important for most natural language process- 
ing tasks, and a range of computational 
techniques have been proposed for its solu- 
tion. None of these has yet proven effective 
on a large scale. In this paper, we describe 
a method for lexical disambiguation of text 
using the definitions in a machine-readable 
dictionm~j together with the technique of 
simulated annealing. The method operates 
on complete sentences and attempts to select 
the optimal combinations of word senses for 
all the words in the sentence simultaneously. 
The words in the sentences may be any of 
the 28,000 headwords in Longman's Dic- 
tionary of Contemporary English (LDOCE) 
and are disambiguated relative to the senses 
given in LDOCE. Our initial results on a 
sample set of 50 sentences are comparable 
to those of other researchers, and the fully 
automatic method requires no hand-coding 
of lexical entries, or hand-tagging of text. 
L Introduction 
The problem of word-sense disambi- 
guation is central to text processing. 
Recently, promising computational methods 
have been suggested \[Lesk, 1986; McDonald 
* Present address: Mathematics DepaJtment. 
University of Texas at El Paso, El Paso, Tx 
79968 
et al., 1990; Veronis and Ide, 1990; Wilks et 
al., 1990; Zernik and ,lacobs, 1990; Guthrie 
et al., 1991; Hearst, 1991\] which attempt to 
use the local context of the word to be 
disambiguated together with information 
about each of its word senses to solve this 
problem. Lesk \[1986\] described a technique 
which measured the amount of overlap 
between a dictionary sense definition and 
the definitions of the words in the local con- 
text of the word to be disambiguated. He 
illustrated his method by successfully 
disambiguating the word "cone" in the 
phrases "pine cone" and "ice cream cone". 
Later researchers have extended this basic 
idea in various ways. Wilks et al., \[1990\] 
identified neighborhoods of the 2,187 con- 
trol vocabulary words in Longman's Dic- 
tionary of Contemporaay English (LDOCE) 
\[Procter, 1978\] based on the co-occurrence 
of words in LDOCE dictionary definitions. 
These neighborhoods were then used to 
expand the word sense definitions of the 
word to be disambiguated, and the overlap 
between the expanded definitions and the 
local context was used to select the correct 
sense of a word. A similar method reported 
by Guthrie et al., \[1991\] who defined subject 
specific neighborhoods of words, using the 
subject area markings in the machine read- 
able version of LDOCE. Hearst \[1991\] sug- 
gests using syntactic information and part- 
of-speech tagging to aid in the disambigua- 
tton. She gathers co-occurrence information 
Aclxs DE COLING-92. NANTES, 23-28 AOt\]T 1992 3 5 9 PROC, OV COLING-92, NAN'rES, AUG. 23-28, 1992 
from manually senseutagged text. Zemik 
and Jacobs \[1990\] also derive their neigh- 
borhoods from a training text which has 
I~en sense-tagged by hand. Their method 
incorporates other clues as to the sense of 
the word in question found in the morphol- 
ogy or by first tagging the text as to part of 
speech. 
Although each of these techniques look 
somewhat promising for disambiguation, the 
techniques have only been applied to several 
words, and the results have been based on 
experiments which repeatedly disambiguate 
a single word (or in \[Zernik and Jacobs. 
1990\], one of three words) in a large 
number of sentences. In the cases where a 
success rate for the technique is reported, 
the results vary from 35% to 80%, depend- 
ing on whether the correct sense is desired, 
or some coarser grained distraction is con- 
sidered acceptable. 
Since only one sense is computed at a 
time, the question arises as to whether and 
how to incolporate the fact that a sense has 
been chosen for a word when attempting to 
disambiguate the next. Should this first 
choice be changed in light of how other 
word senses are selected? Although these 
problems were pointed out in Lesk's origi- 
nal paper, they have not yet been addressed. 
A method of word sense disambigua- 
tion which is designed to operate on a large 
scale and simultaneously for several words 
was suggested by Veronis and Ide \[1990\]. 
The basis of this method is the construction 
of large neural networks which have words 
and word senses (chosen from the machine 
readable version of the Collins Dictionary) 
as nodes. Links are established from a word 
to each of its senses, and from each sense to 
every word in its definition. Inhibiting links 
are constructed between senses of the same 
word. In order to disambiguate a sentence, 
the words in the sentence are activated in 
the network, and this activation is allowed 
to spread with feedback. This cycle is 
repeated a presetected number of times, e.g., 
100. At the end of this process, each word 
in the sentence is disambiguated by choos- 
ing its sense which is most highly activated. 
The authors report encouraging results 
on word pairs such as "page" and "pen" and 
"pen" and "goat". The only complete sen- 
tence reported on was "The young page put 
the goat in the pen" in which "page" and 
"pen" might be expected to work together to 
cause the wrong sense of each to be chosen. 
The inclusion of the word "young" over- 
comes this problem, and both "page" and 
"pen" are correctly disambiguated. 
The authors report that problems are 
presented by such factol~ as maintaining a 
balance between the activation of a word 
and its senses and the fact that a word with 
many senses tends to have more connections 
than one with fewer senses. They indicate 
that matters such as setting thresholds and 
rates of decay also present some difficulties. 
In contrast to the somewhat numerical 
techniques described above, more principled 
methods based on linguistic information 
such as selection restrictions or semantic 
preferences \[Wilks, 1975a; 1975b; Wilks 
and Fass, 1991\] have also been used for lex- 
ical disambiguation. These methods require 
extensive hand crafting by specialists of lex- 
ical items: assigning semantic categories to 
nouns, preferences to verbs and adjectives, 
etc.. Maintaining consistency in these 
categories and preferences is a problem, and 
these methods are also susceptible to the 
combinatorial explosion described above. 
In this paper we suggest the application 
of a computational method called simulated 
annealing to this general class of methods 
(including some of the numerical methods 
referenced above) to allow all senses to be 
determined at once in a computationally 
effective way. We describe the application 
of simulated annealing to a basic method 
similar to that of Lesk \[1986\] whieh also 
uses the subject area markings in LDOCE, 
but which doesn't make use of other 
features such as part of speech tagging. The 
simplicity of the technique makes it fully 
automatic, and it requires no hand-tagging 
AcrEs DE COLING-92, NANTES, 23-28 AOUT 1992 3 6 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
of text or hand-crafting of neighborhoods. 
When this basic method operates under the 
guidance of the simulated annealing algo- 
rithm, sense selections are made con- 
currently for all ambiguous words in the 
sentence in a way designed to optimize their 
choice. The system's performance on a set 
of test sentences was encouraging and can 
be expected to improve when some of the 
refinements mentioned above are incor- 
porated. 
2. Simulated Annealing 
The method of simulated annealing 
\[Metropolis et al., 1953; Kirkpatrick et al., 
1983\] is a technique for solving large scale 
problems of combinatorial minimization. It 
has been successfully applied to the famous 
traveling salesman problem of finding the 
shortest route for a salesman who must visit 
a number of cities in turn, and is now a 
standard method for optimizing the place- 
ment of circuit elements on large scale 
integrated circuits. Simulated annealing was 
applied to parsing by Sampson \[1986\], but 
since the method has not yet been widely 
applied to Computational Linguistics or 
Natural Language Processing, we describe it 
briefly. 
The name of the algorithm is an anal- 
ogy to the process by which metals cool and 
anneal. A feature of this phenomenon is 
that slow cooling usually allows the metal to 
reach a uniform composition and a 
minimum energy state, while fast cooling 
leads to an amorphous state with higher 
energy. In simulated annealing, a parameter 
T which corresponds to temperature is 
decreased slowly enough to allow the sys- 
tem to find its minimum. 
The process requires a function E of 
configurations of the system which 
corresponds to the energy. It is E that we 
seek to minimize. From a stinting point, a 
new configuration is randomly chosen, and a 
new value of E is computed. If the new E is 
less than the old one, the new configuration 
is chosen to replace the older. An essential 
feature of simulated annealing is that even if 
the new E is larger than the old (indicating 
that this configuration is farther away from 
the desired minimum than tile last choice), 
the new configuration may be chosen. The 
decision of whether or not to replace the old 
configuration with the new infelior one is 
made probabilistically. This feature of 
allowing the algorithm to "go up hill" helps 
it to avoid setting on a local minimum 
which is not the actual minimum. In 
succeeding trials, it becomes more difficult 
for configurations which increase E to be 
chosen, and finally, when the method has 
retained the same configuration for long 
enough, that configuration is chosen as the 
solution. In the travelnig salesman example, 
the configurations are the different paths 
through the cities, and E is the total length 
of his trip. The final configmation is an 
approximation to the shortest path through 
the cities. The next section describes how 
the algorithm may be applied to word-sense 
disambiguation. 
3. Word-Sense Disambiguation 
Given a sentence with N words, we 
may represent the senses of the ith word as 
sil, si2, ' sik,, where k~ is the number of 
senses of the ith word which appear in 
LDOCE. A configuration of the system is 
obtained by choosing a sense for each word 
in the sentence. Our goal is to choose that 
configuration which a human disambiguator 
would choose. To that end, we must define 
a function E whose minimum we may rea- 
sonable expect to correspond to the correct 
choice of the word senses. 
The value of E for a given 
configuration is calculated in terms of the 
definitions of the N senses which make it 
up. All words in these definitions are 
stemmed, and the results stored in a list. If 
a subject code is given for a sense, the code 
is treated as a stemmed word. The redun- 
dancy R is computed by giving a stemmed 
word form which appears n times a score of 
n-1 and adding up the scores. Finally, E is 
defined to be - 1 I+R ' 
ACRES DI,; COLING-92, NANTES, 23-28 Ate' 1992 3 6 1 PROC. OF COLING-92, NANTEs, Ant;. 23-28, 1992 
The rationale behind this choice of E is 
that word senses which belong together in a 
sentence will have more words and subject 
codes in common in their definitions (larger 
values of R) than senses which do not 
belong together. Minimizing E will maxim- 
ize R and determine our choice of word 
senses, 
The starting configuration C is chosen 
to be that in which sense number one of 
each word is chosen. Since the senses in 
LDOCE are generally listed with the most 
frequently used sense first, this is a likely 
starting point. The value of E is computed 
for this configuration. The next step is to 
choose at random a word number i and a 
sense S~j of that ith word. The configuration 
C' is is constnacted by replacing the old 
sense of the ith word by the sense S o. Let 
L~E be the change fTom E to the value com- 
puted for C'. If ~E < 0, then C' replaces C, 
and we make a new random change in C'. 
If A~. > 0, we change to C' with probability 
~E 
P = e r. In this expression, T is a constant 
whose initial value is 1, and thedecision of 
whether or not to adopt C' is made by cal- 
ling a random number generator. If the 
number generated is less than P, C is 
replaced by C'. Otherwise, C is retained. 
This process of generating new 
configurations and checking to see whether 
or not to choose them is repeated on the 
order of 1000 times, T is replaced by 0.9T, 
and the loop entered again. Once the loop is 
executed with no change in the 
configuration, the routine ends, and this final 
configuration tells which word seflses are to 
be selected. 
4. Experiments 
To evaluate a method of word sense 
dtsambiguation it is necessary to check the 
results by hand or have text which has 
already been disambiguated by hand to use 
as test data. Since there is no general agree- 
ment on word senses, each system must 
have its own test data. Thus even though 
the algorithm we have described is 
automatic and has coverage of the 28, 000 
words in LDOCE, the evaluation is the tedi- 
ous hand work the system is meant to ease 
or eliminate. 
In our first experiment, the algorithm 
described above was used to disambiguate 
50 example sentences from LDOCE. A stop 
list of very common words such as "the", 
"as", and "of" was removed from each sen- 
tence. The sentences then contained from 
two to fifteen words, with an average of 5.5 
ambiguous words per sentence. Definitions 
in LDOCE are broken down first into broad 
senses which we call "homographs", and 
then into individual senses which distinguish 
among the various meanings. For example, 
one homograph of "bank" means roughly 
"something piled up." There are five senses 
in this homograph which distinguish 
whether the thing piled up is snow, clouds, 
earth by a river, etc. 
Results of the algorithm were evaluated 
by having a Iterate human disambiguate the 
sentences and comparing these choices of 
word senses with the output of the program. 
Using the human choices as the standard, 
the algorithm correctly disambiguated 47% 
of the words to the sense level, and 72 % to 
the homograph level. 
More recently we have developed a 
software tool to improve the process of 
manual disambiguation of test sentences. 
Slight modifications to the software allow it 
to be used in conjunction with the algorithm 
as a computer aided disambiguation system. 
The software displays the text to be disam- 
biguated in a window, and when the user 
chooses a word, all its definitions are 
displayed in another window. The user 
then selects the appropriate sense, and this 
selection is added to a file corresponding to 
the original text. This file is called the key 
and the results of the algorithm are scored 
against it. 
Using this tool, 17 sentences for the 
Wall Street Journal were disambiguated by 
hand relative to LDOCE. The same stop list 
AcI'~ DE COLING-92, NANTES, 23-28 Ao(rr 1992 3 6 2 Pgoc. OF COLING-92, NANTES, AUG. 23-28, 1992 
of common words was used as in the first 
experiment. The algorithm was used to 
disambiguate the 17 sentences, and the 
results automatically scored against the key. 
Results for the Wall Street Journal sentences 
were similar to those for the first experi- 
ment. 
One difficulty with the present algo- 
rithm is that long definitions tend to be 
given preference over shorter ones. Words 
defined succinctly by a synonym are greatly 
penalized. The function E must be made to 
better model the problem to improve perfor- 
mance. On the other hand, the simulated 
annealing itself seems to be doing very well 
at finding the minimum. In those cases 
where the configuration selected is not the 
correct disambiguation of the sentence, the 
correct disambiguation never had a lower 
value of E than the configuration selected. 
Experiments in which we varied the begin- 
ning temperature and the rate of cooling 
didn't change tile configuration ultimately 
selected and seemed to show that those 
parameters are not very delicate. 
Direct comparisons of these success 
rates with those of other methods is difficult. 
Veronis and Ide \[1990\] propose a large scale 
method, but results are reported for only one 
sentence, and no success rate is given. 
None of the other methods was used to 
disambiguate every ambiguous word in a 
sentence. They were applied to one, or at 
most a few, highly ambiguous words. It 
appears that in some cases the fact that our 
success rates include not only highly ambi- 
guous words, but some words w~th only a 
few senses is offset by the fact that other 
researchers have used a broader definition of 
word sense. For example, the four senses of 
"interest" used by Zernlk and Jacobs \[1990\] 
may correspond more closely to our two 
homographs and not our ten senses of 
"interest." Their success rate in tagging the 
three words "interest", "stock", and "bond" 
was 70%. Thus it appears that the method 
we propose is comparable in effectiveness to 
the other computational methods of word- 
sense disambiguation, and has the advan- 
tages of being automatically applicable to all 
the 28,000 words in LDOCE and of being 
computationally pructical. 
Below we give two examples of the 
results of the technique. The words follow- 
ing the arrow are the stemmed words 
selected from the definitions and used to cal- 
culate the redundancy. The headword and 
sense numbers are those used in the 
machine readable version of LDOCE. 
EXAMPLE SENTENCE 1 
The fish floundered on the river bank, 
struggling to breathe 
DISAMBIGUATION 
1) fish hw 1 sense ! : DEF -> fish 
creature whose blood change tempera- 
ture according around live water use its 
FIN tail swim 
2) river hw 0 sense 1 : DEF -> river wide 
nature stream water flow between bank 
lake another sea 
3) bank hw 1 sense 1 : DEF -> bank land 
along side river lake 
4) sta-uggle hw 1 sense 0 : DEF -> s~ug- 
gle violent move fight against thing 
5) breathe hw 0 sense 2 : DEF -> breathe 
light live 
EXAMPLE SENTENCE 2 
The interest on my 
accrued over the years 
DISAMBIGUATION 
bank account 
1) interest hw 1 sense 6 : DEF -> interest 
money paid use 
2) bank hw 4 sense 1 : DEF -> bank place 
money keep paid demand where related 
activity go 
3) account hw 1 sense 5 : DEF -> account 
record state money receive paid bank 
busy particular period date 
ACRES DE COLING-92, NAI'CI'ES, 23-28 Ao~zr 1992 3 6 3 Prtoc. oi: COI,1NG-92, NANTES, AUG. 23-28, 1992 
4) accrue hw 0 sense 0 : DEF -> accrue 
become big more addition 
5) year hw 0 sense 3 : DEF -> year period 
365 day measure any point 
Finally, we show two graphs which 
illustrate the convergence of the simulated 
annealing technique to the minimum energy 
• ' (E) level. The second graph is a close-up of 
the final cycles of the complete process 
shown in the first graph. 
I~QI-- 
tm~-, 
Imm- 
tilal-- 
lira-- 
dam-, 
~umamlinll 
..... 44 
f .... o .., 
J L ....... ) 
5--- ._-St--" 
--- ,; ,--- I-.. ~-.::: 
*~113"" 
, ,.~lu-tl " 
Y,.'I3- 
• 
~111"" 
• ,)" " 'I"! ~ 
/ I 
~uf ~x 14 "1 
Ama~llRg 
'y, 
= 
5. Conclusion 
This paper describes a method for 
word-sense disambiguation based on the 
ample technique of choosing senses of the 
words in a sentence so that their definitions 
in LDOCE have the most words and subject 
codes in common. The amount of computa- 
tion necessary to find this optimal choice 
exactly quickly becomes prohibitive as the 
number of ambiguous words and the number 
of senses increase. The computational tech- 
nique of simulated annealing allows a good 
approxamation to be computed quickly. 
Thus all the words m a sentence are disam- 
biguated simultaneously, m a reasonable 
rune, and automatically (with no hand 
disambiguation of training text). Results 
using this technique are comparable to other 
computational techniques and enhancements 
incorporating co-occurrence and part-of- 
speech information, which have been 
exploited in one-word-at-a time techniques, 
may be expected to improve the perfor- 
mance. 
ACRES DE COLING-92, NAtCrEs, 23-28 AO~l' 1992 3 6 5 PROC. OF COLING-92. N^NTES. AUG. 23-28, 1992 
ACTES DE COLING-92, NANTES, 23-28 AOtrr 1992 3 6 4 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 

References 

Guthrie, I., Guthrie, L., Wilks, Y., and Aidi- 
nejad, H. (1991). Subject-Dependent 
Co-Occurrence and Word Sense 
Disambiguafion, Proceedings of the 
29th Annual Meeting of the Association 
for Computational Linguistics, Berke- 
ley, CA. June 1991. pp. 146-152. 
Also Memoranda in Computer and 
Cognitive Science MCCS-91-206, 
Computing Research Laboratory, New 
Mexico Smm University. 

Hearst, M. (1991). Toward Noun Homonym 
Disambiguation - Using Local Context 
in Large Text Corpora Proceedings of 
the Seventh Annual Conference of the 
UW Centre for the New OED and Text 
Research, Using Corpora pp. 1-22. 

Kirkpatrick, S., Gelatt, C. D., and Vecchi, 
M. P. (1983). Optimization by Simu- 
tared Annealing, Science vol. 220, pp. 
671-680. 

McDonald, J. E., Plate, T, and 
Schvaneveldt, R. W. (1990). Using 
Pathfinder to Extract Semantic Infor- 
mation from Text. In Schvaneveldt, R. 
W. (ed.) Pathfinder Associative Net- 
works: Studies in Knowledge Organi- 
sation, Norwood, NJ: Ablex. 

• Metaopolis, N., Rosenbluth, A., Rosenbluth, 
M., Teller, A., and Teller, E. (1953) J. 
Chem. Phys. vol. 21, p.1087. 

Procter, P., R. llson, J. Ayto, et al. (1978) 
Longman Dictionary of Contemporary 
English. Harlow, UK: Longman Group 
Limited. 

Sampson, G. (1986). A Stochastic Approach 
to Parsing. llth International Confer- 
ence on Computational Linguistics 
(COL1NG-86). pp. 151-155. 

Veronis, J. and N. Ide.(1990).Word Sense 
Disambiguation with Very Large 
Neural Networks Extracted from 
Machine Readable Dictionaries. 
Proceedings of the 13th Conference on 
Computational Linguistics (COLING- 
90), Helsinki, Finland, 2, pp. 389-394. 

Wilks, Yorick A. (1975a). An Intelligent 
Analyzer and Understander of English. 
Communications of the ACM, 18, 5, pp. 
264-274. Reprinted in "Readings in 
Natural Language Processing," Edited 
by Barbara J. Grosz, Karen Sparck- 
Jones and Bonnie Lynn Webber, Los 
Altos: Morgan Kaufmann, 1986, pp. 
193-203. 

Wilks, Yorick A. (1975b). A Preferential 
Pattern-Seeking Semantics for Natural 
Language Inference. Artificial Intelli- 
gence, 6, pp. 53-74. 

Wilks, Y. and Fass. D. (1991). Preference 
Semantics: a family history, To appear 
m Computing and Mathematics with 
Applications (in press). A shorter ver- 
sion in the second edition of the Ency- 
clopedia of Artificial Intelligence. 

Zernik, Lift and Paul Jacobs (1990). Tag- 
ging for Learning: Collecting Thematic 
Relations from Corpus. Proceedings of 
the 13th International Conference on 
Computational Linguistics (COLING° 
90), Helsinki, Finland, 1, pp.34-37. 
