Bridging the Gap between Dictionary and Thesaurus 
Oi Yee Kwong 
Computer Laboratory, University of Cambridge 
New Museums Site, Cambridge CB2 3QG, U.K. 
oyk20@cl.cam.ac.uk 
Abstract 
This paper presents an algorithm to integrate dif- 
ferent lexical resources, through which we hope to 
overcome the individual inadequacy of the resources, 
and thus obtain some enriched lexical semantic in- 
formation for applications such as word sense disam- 
biguation. We used WordNet as a mediator between 
a conventional dictionary and a thesaurus. Prelimi- 
nary results support our hypothesised structural re- 
lationship, which enables the integration, of the re- 
sources. These results also suggest that we can com- 
bine the resources to achieve an overall balanced de- 
gree of sense discrimination. 
1 Introduction 
It is generally accepted that applications such as 
word sense disambiguation (WSD), machine trans- 
lation (MT) and information retrieval (IR), require 
a wide range of resources to supply the necessary 
lexical semantic information. For instance, Cal- 
zolari (1988) proposed a lexical database in Italian 
which has the features of both a dictionary and a 
thesaurus; and Klavans and Tzoukermann (1995) 
tried to build a fuller bilingual lexicon by enhancing 
machine-readable dictionaries with large corpora. 
Among the attempts to enrich lexical information, 
many have been directed to the analysis of dictio- 
nary definitions and the transformation of the im- 
plicit information to explicit knowledge bases for 
computational purposes (Amsler, 1981; Calzolari, 
1984; Chodorow et al., 1985; Markowitz et al., 
1986; Klavans et al., 1990; Vossen and Copestake, 
1993). Nonetheless, dictionaries are also infamous 
of their non-standardised sense granularity, and the 
taxonomies obtained from definitions are inevitably 
ad hoc. It would therefore be a good idea if we can 
unify our lexical semantic knowledge by some exist- 
ing, and widely exploited, classifications such as the 
system in Roget's Thesaurus (Roget, 1852), which 
has remained intact for years and has been used in 
WSD (Yarowsky, 1992). 
While the objective is to integrate different lex- 
ical resources, the problem is: how do we recon- 
cile the rich but variable information in dictionary 
1487 
senses with the cruder but more stable taxonomies 
like those in thesauri? 
This work is intended to fill this gap. We use 
WordNet as a mediator in the process. In the fol- 
lowing, we will outline an algorithm to map word 
senses in a dictionary to semantic classes in some 
established classification scheme. 
2 Inter-relatedness of the Resources 
The three lexical resources used in this work are the 
1987 revision of Roget's Thesaurus (ROGET) (Kirk- 
patrick, 1987), the Longman Dictionary of Contem- 
porary English (LDOCE) (Procter, 1978) and Word- 
Net 1.5 (WN) (Miller et al., 1993). Figure 1 shows 
how word senses are organised in them. As we have 
mentioned, instead of directly mapping an LDOCE 
definition to a ROGET class, we bridge the gap with 
WN, as indicated by the arrows in the figure. Such 
a route is made feasible by linking the structures in 
common among the resources. 
Words are organised in alphabetical order in 
LDOCE, as in other conventional dictionaries. The 
senses are listed after each entry, in the form of text 
definitions. WN groups words into sets of synonyms 
("synsets"), with an optional textual gloss. These 
synsets form the nodes of a taxonomic hierarchy. 
In ROGET, each semantic class comes with a num- 
ber, under which words are first assorted by part of 
speech and then grouped into paragraphs according 
to the conveyed idea. 
Let us refer to Figure 1 and start from word x2 in 
WN synset X. Since words expressing every aspect 
of an idea are grouped together in ROGET, we can 
therefore expect to find not only words in synset X, 
but also those in the coordinate WN synsets (i.e. M 
and P, with words ml, m2, pl, P2, etc.) and the su- 
perordinate WN synsets (i.e. C and A, with words 
cl, c2, etc.) in the same ROGET paragraph. In 
other words, the thesaurus class to which x2 belongs 
should include roughly X U M U P U C U A. Mean- 
while, the LDOCE definition corresponding to the 
sense of synset X (denoted by D~) is expected to be 
similar to the textual gloss of synset X (denoted by GI(X)). 
In addition, given that it is not unusual for 
A 
120. N. cl, c2, ... (in C); /~'"--~ 
ml, m2, ... (in M); pl, p2, B C {el, c2, ... }. GIfC) 
... (in P); xl, x2, ... (in X) I\[ 
V .... Adj .... E F M P X 
\[ml. m2.... }.GI(M) {pl, p2, ...I, GI(P} {xl, x2, ... }, GI(X) 121.N .... /~ 
R T 
x2 
I.... definition (Dx) similiar t,) GI(X) 
or defined in terms of words in 
X t)r C, etc. 
2 .... 
3 .... 
x3 
I .... 
2 .... 
(ROGEr) 0~VN) (LDOCE) 
Figure 1: Organisation of word senses in different resources 
dictionary definitions to be phrased with synonyms 
or superordinate terms, we would also expect to find 
words from X and C, or even A, in the LDOCE def- 
inition. That means we believe Dx ~ GI(X) and 
D~N(XUCUA) 5¢. 
3 The Algorithm 
The possibility of using statistical methods to assign 
ROGET category labels to dictionary definitions has 
been suggested by Yarowsky (1992). Our algorithm 
offers a systematic way of linking existing resources 
by defining a mapping chain from LDOCE to RO- 
GET through WN. It is based on shallow process- 
ing within the resources themselves, exploiting their 
inter-relatedness, and does not rely on extensive sta- 
tistical data. It therefore has an advantage of being 
immune to any change of sense discrimination with 
time, since it only depends on the organisation but 
not the individual entries of the resources. Given a 
word with part of speech, W(p), the core steps are 
as follows: 
Step 1: From LDOCE, get the sense definitions 
Dz, ..., Dt under the entry W(p). 
Step 2: From WN, find all the synsets 
Sn{wl,w2,...} such that W(p) e Sn. Also 
collect the corresponding gloss definitions, 
Gl(Sn), if any, the hypernym synsets Hyp(Sn), 
and the coordinate synsets Co(Sn). 
Step 3: Compute a similarity score matrix .4 for 
the LDOCE senses and the WN synsets. A 
similarity score .4(i,j) is computed for the i th 
LDOCE sense and the jth WN synset using 
a weighted sum of the overlaps between the 
LDOCE sense and the WN synset, hypernyms, 
and gloss respectively, that is 
.4(i,j) = al\[D, M Sj\[ + a2IDi M gyp(Sj)\[ 
+ asIni N GI(Sj) I 
For our tests, we tried setting az = 3, a2 = 5 
and as = 2 to reveal the relative significance of 
finding a synonym, a hypernym, and any word 
in the textual gloss respectively in the dictio- 
nary definition. 
Step 4: From ROGET, find all paragraphs 
Pm{wi,w2, ...} such that W(p) E pro. 
Step 5: Compute a similarity score matrix B for the 
WN synsets and the ROGET classes. A simi- 
larity score B(j, k) is computed for the jth WN 
synset (taking the synset itself, the hypernyms, 
and the coordinate terms) and the k th ROGET 
class, according to the following: 
B(j, k) = bllSj N Pkl + b2IHyp(Sj) M Pkl 
+ bHCo(Sj) n Pkl 
We have set bz = b2 = ba = 1. Since a ROGET 
class contains words expressing every aspect of 
the same idea, it should be equally likely to find 
synonyms, hypernyms and coordinate terms in 
common. 
Step 6: For i = I to t (i.e. each LDOCE sense), find 
max(A(i,j.)) from matrix A. Then trace from 
matrix B the jth row and find rnax(B(j,k)). 
The i th LDOCE sense should finally be mapped 
to the ROGET class to which Pk belongs. 
We have made an operational assumption about 
the analysis of definitions. We did not attempt to 
parse definitions to identify genus terms but simply 
approximated this by using the weights az, a2 and as 
in Step 3. Considering that words are often defined 
in terms of superordinates and slightly less often by 
synonyms, we assign numerical weights in the order 
a2 > az > as. We are also aware that definitions can 
take other forms which may involve part-of relations, 
membership, and so on, though we did not deal with 
them in this study. 
4 Testing and Results 
The algorithm was tested on 12 nouns, listed in Ta- 
ble 1 with the number of senses in the various lexical 
resources. 
The various types of possible mapping errors are 
summarised in Table 2. Incorrectly Mapped and 
Unmapped-a are both "misses", whereas Forced Er- 
ror and Unmapped-b are both "false alarms". 
The performance of the three parts of mapping 
is shown in Table 3. The "carry-over error" is only 
1488 
Word R W L Word R W L 
Country 3 4 5 Matter 8 5 7 
Water 9 8 8 System 6 8 5 
School 3 6 7 Interest 14 8 6 
Room 3 4 5 Voice 4 8 9 
Money 1 3 2 State 7 5 6 
Girl 4 5 5 Company 10 8 9 
Table 1: The 12 nouns used in testing 
Target Exists 
Yes 
No 
Mapping Outcome 
Wrong Match No Match 
Incorrectly Mapped Unmapped-a 
Forced Error Unmapped-b 
Table 2: Different types of errors 
applicable to the last stage, L -+R, and it refers to 
cases where the final answer is wrong as a result of 
a faulty outcome from the first stage (L --+W). 
L--~W W--~R L-~R 
Accurately Mapped 68.9% 75.0% 55.4% 
Incorrectly Mapped 12.2% 1.4% 4.1% 
Unmapped-a 2.7% 6.9% 13.5% 
Unmapped-b 13.5% 5.6% 16.2% 
Forced Error 2.7% 11.1% - 
Carry-over Error - - 10.8% 
Table 3: Performance of the algorithm 
5 Discussion 
Overall, the Accurately Mapped figures support our 
hypothesis that conventional dictionaries and the- 
sauri can be related through WordNet. Looking at 
the unsuccessful cases, we see that there are rela- 
tively more "false alarms" than "misses", showing 
that errors mostly arise from the inadequacy of indi- 
vidual resources because there are no targets rather 
than from partial failures of the process. Moreover, 
the number of "misses" can possibly be reduced if 
more definition patterns are considered. 
Clearly the successful mappings are influenced by 
the fineness of the sense discrimination in the re- 
sources. How finely they are distinguished can be 
inferred from the similarity score matrices. Reading 
the matrices row-wise shows how vaguely a certain 
sense is defined, whereas reading them column-wise 
reveals how polysemous a word is. 
While the links resulting from the algorithm can 
be right or wrong, there were some senses of the 
test words which appeared in one resource but had 
no counterpart in the others, i.e. they were not at- 
tached to any links. Thus 18.9% of the LDOCE 
senses, 11.1% of the WN synsets and 58.1% of 
the ROGET classes were among these unattached 
senses. Though this implies the insufficiency of us- 
ing only one single resource in any application, it also 
suggests there is additional information we can use 
to overcome the inadequacy of individual resources. 
For example, we may take the senses from one re- 
source and complement them with the unattached 
senses from the other two, thus resulting in a more 
complete but not redundant sense discrimination. 
6 Future Work 
This study can be extended in at least two paths. 
One is to focus on the generality of the algorithm by 
testing it on a bigger variety of words, and the other 
on its practical value by applying the resultant lexi- 
cal information in some real applications and check- 
ing the effect of using multiple resources. It is also 
desirable to explore definition parsing to see if map- 
ping results will be improved. 

References 
R. Amsler. 1981. A taxonomy for English nouns and 
verbs. In Proceedings of ACL '81, pages 133-138. 
N. Calzolari. 1984. Detecting patterns in a lexical data 
base. In Proceedings of COLING-8~, pages 170-173. 
N. Calzolari. 1988. The dictionary and the thesaurus 
can be combined. In M.W. Evens, editor, Relational 
Models of the Lexicon: Representing Knowledge in Se- 
mantic Networks. Cambridge University Press. 
M.S. Chodorow, R.J. Byrd, and G.E. Heidorn. 1985. 
Extracting semantic hierarchies from a large on-line 
dictionary. In Proceedings of ACL '85, pages 299-304. 
B. Kirkpatrick. 1987. Roger's Thesaurus of English 
Words and Phrases. Penguin Books. 
J. Klavans and E. Tzoukermann. 1995. Combining cor- 
pus and machine-readable dictionary data for building 
bilingual lexicons. Machine Translation, 10:185-218. 
J. Klavans, M. Chodorow, and N. Wacholder. 1990. 
From dictionary to knowledge base via taxonomy. In 
Proceedings of the Sixth Conference of the University 
of Waterloo, Canada. Centre for the New Oxford En- 
glish dictionary and Text Research: Electronic Text 
Research. 
J. Markowitz, T. Ahlswede, and M. Evens. 1986. Se- 
mantically significant patterns in dictionary defini- 
tions. In Proceedings of ACL '86, pages 112-119. 
G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and 
K. Miller. 1993. Introduction to ~,VordNet: An on- 
line lexical database. Five Papers on WordNet. 
P. Procter. 1978. Longman Dictionary of Contemporary 
English. Longman Group Ltd. 
P.M. Roget. 1852. Roger's Thesaurus of English Words 
and Phrases. Penguin Books. 
P. Vossen and A. Copestake. 1993. Untangling def- 
inition structure into knowledge representation. In 
T. Briscoe, A. Copestake, and V. de Paiva, editors, In- 
heritance, Defaults and the Lexicon. Cambridge Uni- 
versity Press. 
D. Yarowsky. 1992. Word-sense disambiguation using 
statistical models of Roget's categories trained on 
large corpora. In Proceedings of COLING-92, pages 
454-460, Nantes, France. 
