C01.1~ 82. J. Horec~p led.) 
I~nh-Holl~d Publishing Co,many 
t~ A~alemla. 1982 
APPROACHES TO THESAURUS PRODUCTION 
A. qichiels, d. No~l, Fnglish Department 
University of Liege 
Place Cockerill, 3, 
B-4000 Liege 
Belgium 
We contrast two approaches to thesaurus production : the 
traditional and intuitive one versus the Amsler-type 
procedure, which interactively 9,enerates filiations among 
the genus words in a computerized dictionary. We discuss 
the application of such a orocedure to our lexical data 
base (LONDON DICTIONARY OF CONTEdPORARY ENGLISH). 
I INTRODUCTION 
Since 1979 we have had available, by contract with LON~,I/uN Ltd, the c(:mputer tape 
of LDOCE (IDNCUqN DICTIO~NARY OF CONrlT/'IPOP~LRY \[iNGLISH). Our main concern has been 
the development of a syntactico-semantic analyzer of general "English making full 
use of all the formatted information contained in our dictionary file. (\lichiels 
et al. 1980; ~lichiels 1982). 
LDOCE is a medit~a-sized dictionary of core English containing some 60,OO0 entries 
which feature the following types of information : 
a) fully formalized 
Part of speech (POS) 
Grammatical fields, i.e. sets of grarmnatical codes, which describe the 
environment that the code-bearing item can or must fit in. 
l%%at makes these grammatical fields particularly suitable for the purposes of 
machine disambiguation of natural language is that they are assigned to word- 
senses (definitions) as well as to whole lexical entries. An example is provided 
by the LDOCE entry CONSIDER (p. 233). 
in the example, string 
I consider you a fool 
the two-NP chain ( YOU A FOOL ) satisfies the \[XI~ code associated with the 
NP I NP 2 
second definition of the verb and enables the analyzer to select the appropriate 
definition in context ("scanning procedures" : cf. qichiels et al. 1980) 
Definition space, i.e. 
(i) semantic codes : inherent features for nouns, selectional 
restrictions for adjectives and verbs 
Consider the entry HA~¢4ER, verb. As the definition space does not appear in the 
printed version, weJrefer the reader to the computer file where, for the third 
definition, the semantic eodes indicate that bo~h the deep subject and the deep 
object must be O~ ' i.e. 5~r'~kN\] . 
(ii) subject codes (~ld labels) 
227 
228 A. MICHIELS and J. NOI~L 
ix : In the entry H~M~, def. 3 is assigned SPXX iSports) and def. 5 BCZS 
(EC : Economics, Z : subdivision indicator, S : Stock ixchange and Investment). 
b) partly formalized 
In most dictionaries, definitions are nothing else but strings of 
natural language, albeit of a special type (Smith and Maxwell 1973; Amsler 1980, 
p. I08). A first step towards formalizing definitions has been taken by the LD(XIE 
lericographers : all the LIX)C~ examples and definitions are written in a 
controlled defining vocabulary of some 2,100 items (lexemes - e.g. HISTORY - and 
morphemes - e.g. RE- and -IZATION - no morphological variants). 
Our concern in this paper will be with how to produce thesauri from dictionary 
files. What prompts us to examine this problem is the existence of two contrasting 
approaches to thesaurus-production : the first is exemplified by LOLHX (LON6MAN 
LEXICON OF CONTEMPORARY 19~GLISH, J 981 ), the second by Amsler 1980. 
II THESAURUS PRODUCTION 
Although LOLEX takes over a subset of the ~ definitions, both the choice of 
thesauric categories (e.g.J.212 verbs : DISMISSING AhD Rh-TIRING PEOPLE) and the 
assignment of a lexical item to one of several categories (e.g. DISBAND assigned 
to J. 212) are based on the lexicographer's intuition and knowledge of prcvlous 
work in the field (cf. l~get's, etc.). 
Amsler's approach is totally different (see Amsler 1980) : using as data base the 
computer files of the MPD (Merriam Pocket Dictionary) prepared by John O\]ney 
(Olrtey 1968), he develops an interactive procedure for thesaurus production. The 
first step is a manual selection and disambiguation of the GHqUS TEI~4S in the 
definitions of nouns and verbs. By GENUS TERM is to be understood the first word 
of the definition which has the same POS as the definiendum a~d can serve as its 
superordinate. For example, in the first definition of HAMMER, the genus term is 
STRIKE, whereas in the fifth it is DECLARE. 
It should be realized t~hat genus term and syntactic head do not always coincide, 
and this mismatch is a major obstacle in the development of autocratic procedures 
for genus term selection. Contrast in this respect tho first and the second homo- 
graphs of the LDOCE headword BOA (page IO5). The second poses no problem : 
syntactic head and genus term are identical (GARMENT)° In the first, however, the 
genus term is lodged inside the second OF-phrase,itself embedded in the first, 
which in its turn depends on the syntactic head ANY. 
Once they have been selected, the genus terms are disambiguated with reference to 
the data base itself by selecting the appropriate homograph and definition 
numbers. A convenient example, drawn from LDOCE, ~s the disambiguation of the 
genus term CONSIDER in the definitions of LOOK ON (L X 9 esp. as, wit~: to 
consider; regard) CONSIDER here will be disambiguated as CONSIDER (m, 2) (~ = non ° 
honDgraphic, 2 = second definition - cf. LDOCE entry CONSIDER, po 253) 
The next step is the use of a tree-growing algorithm, which Amsler has progr~ed 
and applied to his MPD data base. It is based on a filiation technique between 
l~xical entries and genus terms. We shall illustrate it with respect to the item 
VEHICLE (x, 1 ) in our own data base. Descending the filiation path, the procedure 
will select all the items which use ~he word V~HICLE (w, 1 ) as genus term in their 
definitions. Among these are CAR (x,'I/2/3) and CARRIAGE (x, I/2/7). CARRIAGE in 
tm'n functions as a genus term and yields its own sub-class, which contains, mnong 
others, the items BROUGHAM (x, x - non-homographic + a single definition) and 
GIG (1,1) - which are themselves defined by means of the genus term CARRIAGE. In 
our example, the procedure stops at B~ alxl GIG because these lexical i~-~s 
are nowhere in the ~Cti~ used as ~ terms. It results in a n,rti~l 
APPROACHES TO THESAURUS PRODUCTION 229 
taxo m headed by the item VI~IICLE : 
LEVEL I : V~ICLE (x, I) 
LaV~Z : ~ (x, llZ13) G~ (x, IIZI?) 
LEVEL 3 :"" ~BROUGH~M (x, X) 
Going up the filiation path from the werd-sense VEHICLE (x, I ) aae finds as 
syntactic head the pro-form SO~ING - there is no genus term. Even if one is 
prepared to consider S(MEI~ING as the genus term (relaxing the HIS identity 
condition), the thesauric link that is obtained does not yield more information 
than the semantic codes associated with the relevant definition. 
A clear advantage of ~nsler's procedure over intuitive thesaurus-production (as 
exemplified in LOLIK) is that it can lead to an i~provement of the dictionary data 
base that is used as source. To take only one example : suppose that one is 
convinced that there should be a thesmn-ic link (hyponym - superordinate) between 
V\]~ICI~ and ~. If ~ is used as source data base for thesaurus - 
production, the link in question will not be retrieved (INSTRIMENT is not used as 
genus term in the LDOCE definition of VEHICLE (x, 1)), which inevitabl~-~aises the 
question of whether or not to revise the definition of VEHICLE. 
III I~I%OITING ~ DEFINITIONS 
applied to the ~ definitions, Amsler's technique reveals an interesting 
consequence of a controlled defining vocabulary : the thesauric hierarchies are 
more shallow in ~ than in MPO (which does not feature a controlled defining 
vocabulary). To give an example, ~ defines LIMOUSINE by memos of the genus term 
SEDAN. 
Level one : VI~ICI~ 
Level two : AUTCHOBILE 
Level three : 'SEDAN 
Level four : LIMOUSIN£ : ...... s.ed..a~_ 
SEDAN is not available as genus term in LDO(~ because it is not in the defining 
vocabulary. LIMOUSINE, defined by means of the genus term CAR, is level 3, not 4 
in LDOCF : 
Level one : VEHICLE 
Level two : CAR 
Level three : LIMOUSINE : ...... car 
The shallow hierarchies based on LDOCE definitions are no doubt less revealing for 
the purpose of thesauric organisation. But the use of a controlled defining 
vecabulary makes it easier to process dictionary definitions in terms of both : 
I ) auto~mtizing genus term selection and disambiguation and 
2) parsing whole definition strings (as opposed to I ) 
This is because the lexicon that the parser must have access to can be determined 
in advance. It is NOT open-ended (open-ended means, practically, as extensive as 
the defined vecabulary, i.e. the whole list of dictionary entries - cf. Amsler 
1980, p. TOg). 
Schematically, the decision to use a controlled vocabulary to write dictionary 
definitions can have three undesirable consequences : 
I).- reduction of the amount of information conveyed by the definition : OVERUSE 
of i~licitly or explicitly partial definitions (in the sense of Bierwisch & 
Kiefer 1969, p. 66-68) - the latter are incomplete definitions which wear 
230 A. MICHIELS and J. NOeL 
their incompleteness on their sleeve, for em~ple : 
TARANqIF~ : spider of a certain kind. 
2) .- semantic overloading of all-purpose items such as GET, HAVE) MAKE, TAKE, etc. 
E.g. K~P (1, 8) : to have for some time or for more time (LDOCE, p~ 605) 
3) .- uncontrolled increase in s>ntactic complexity in the differentia {non-genus 
part of the definition) : 
a) degree of embedding - not only in clauses, but also - and perhaps more 
importantly - in complex nominal groups (cf. Amsler 1980, p. 108 on ANT- 
EATING in the definition of AARDVARK) 
b) anaphoric relations 
c) scope relations (conjunction plays a pr~inent part here) 
Compare the following two definitions of INSULIN 
i) .- OALDOCE (Hornby 1980~ - 18 words 
substa~e (a hormone ) prepared from the pancreas ~ of sheep used in the 
medical treatment of sufferers from diabetes ~ 
(M = does not belong to the LDOCE defining vocabulary). 
ii) .- LDOCE - 37 words 
a substance produced naturally in the body which allows sugar to be used for 
ENEI~GY, esp. such a substance taken frc~ sheep to be given to sufferers from 
a disease (DIABETES) which makes them lack this substance. 
(ENI~GY and DIABETES in capital letters because not in LDOCE defining 
vocabulary). 
This third consequence stems from the avoidance of non-defining vocabulary items 
by means of P~E, which displaces the burden towards syntactic elaboration, 
a point cogently made in Ralph 1980 (p. 117). 
This "grammaticalization" of much of the information conveyed by LDOCE dictionary 
definitions points to the need to analyse whole definition strings rather than 
just the genus terms (see the process of ANNOTATING dictionary definitions in No~l 
et al. 1981). 
Before we consider how to tackle the problem of disambiguating definition strings, 
we must examine a much easier way of retrieving at least some thesauric links from 
the LDOCE dictionary file. The LDOCE lexicographers sometimes provide ready-made 
thesauric links : 
I ).-cross-reference to an item belonging to the defining vocabulary : 
CAPTAIN (2, ~() : to be captain of; c~; 
synonyms 
2) .- cross-reference to a non-defining vocabulary item : 
ABBEY (x, 1) : ...... ; MONASTERY or CONVEMf 
synonyms 
3) .- cross-reference to a non-defining vocabulary item inside an LDOCE definition, 
with a paraphrase in the defining vocabulary. An exa~le is to be found in the 
LDOCE definition of INSULIN quoted above : 
disease (DIABETES) which .... ~n~ 
genus term, $ 
supererdinate 
In No~l et al. 1981 and ~lichiels et al. 1981 we have shown the power of the IDOCE 
grmmnatical codes to disambiguate items in context, more specifically in the 
context provided by the definition strings themselves. For instance, in the LDOCE definition ~ ~ (~, D 
APPROACHES TO THESAURUS PRODUCTION 231 
- a wicked person who leads ~__ple t.o__dg._wf.ong or harms those who are kind to 
him 
the annotating process will select the V3 code for LEADS, because it occurs in 
the syntactic envirorrnent NP + TO + VP (NP = poople, VP = do wrong) defined by 
V3 . This assigrBnent enables the system to reject all the word senses for LEAD 
in LDOCE except the appropriate one (one out of nine; cf. entry L~I page 622). 
We would like here to put forward a further possible exploitation of the LDOCE 
grammatical codes for the purpose of dissmbiguating dictionary definitions. It 
applies to genus terms and consists in the selection of a preferred word-sense 
for the genus term on the basis of a similarity in grarmnatical code between 
definiens and genus term. Let us turn back to our fourth example, the entry 
LOOK ON (2, ~). The first genus term is CONSIDP~R. LOOK ON is assigned the 
granmmtical cede X9 . The second definition of CONSIDER is assigned the 
X (to be) 1, 7 code. The similarity in grammatical code X serves as criterion 
to disambiguate CONSIDER in the definition of LOOK ON as CONSIDER (x) 2). 
The LDOCE semantic and subject codes can be exploited in a similar way. It can be 
hypothesized that the combined use of all the formalized information types in 
LDOCE will prove to have a high disambiguating power and turn out to be a useful 
tool for the setting up of thesauric classes. 
A last point that we wish to touch on concerns the nature of the genus terms in a 
dictionary data base which makes use of a controlled defining vocabulary. The 
grmmnaticalization of information due to paraphrase in LDOCE gives rise to a 
special distribution of genus terms along a FULL WORD PROFORM gradient. 
FULL WORD 
LIQUID SUBSTANCE 
ANALYSIS 
(hyponym superordinate) 
PROFOI~4 
SCMETHING 
ANYTHING 
cf. LDOCE def. of VEHICLE (x, I) 
PROCESS 
ACTION 
As compared with MPD, for example, LDOCE genus terms tend to cluster toward the 
profof~ end of the gradient. When the point is reached where the genus term does 
not provide more specific information than the semantic codes assigned to the 
definiendun, two conclusions can be drawn : 
1 ).- the lexicographers of the source c~ictionary must consider whether their 
definition is appropriate, as it does not show the thesauric links 
perspicuously; 
2) .- the whole definition string must be processed and disambiguated, so as to 
retrieve the information that a dictionary which does not use a controlled 
defining vocabulary would have included in the genus term. 
At the same time, the analysis of whole definition strings will reveal a number 
of thesauric links (such as that between INSTR\[lqENT and ACTION discussed in 
Miqhiels et al. 1980) that the study of genus terms, limited to the HYPONYM- 
~/PERORDINATE relation, is unable to retrieve. 
232 A. MICHIELS and A NOEL 
OALDOCE 
LIX~E 
LOLEX 
Roget's 

REFERENCES 

Hornby, A.S., ieditor-in-chief) OXFORD ADVANCID LEARNER'S DICTIONARY 
OF CURRENT ENGLISH, OUP London, 1980 

LON(IMAN DICTIONARY OF CONTI~ORARY ENGLISH, editor-in-chief : 
P. Procter, 1978 \] 

LONGMAN 12D(ICON OF CONTemPORARY ENGLISH, Tom McArthur, 1.981 

Roget's THESAURUS OF ENGLISH WORDS AND PHRASES, Penguinled, 1966 

Amsler 1980 = Amsler, R.A., THE STRUCTURE OF THE ~RRI/~I-~EBSTER 
POCKET DICTIONARY D TR-164, University of Texas at Austin 
Ph D., Dec. 1980 

Bierwisch and Kiefer 1969 = Bierwisch, M. and Kiefer, F., Remarks on Definitions 
in Natural Language, in l(ie£er, F. (ed), STUDIES IN 
SYNTAX AND SI~4ANTICS, D. Reidel, Dordrecht, Holland, 
1969 

Michiels 1982 = Michiels, A., EXPLOITING A IARGE DICTIONARY DATA BASE, 
Ph D thesis, University of Liege, 1982 (mimeographed) 

Michiels et al. 1980 = Michiel$, A., Mullemxlers, J., No~l, J., Exploiting a 
large data base by Longman) in COLING 80, 1980, 
p. 573-582 

Michiels et al. 1981 = Michiels, A., No~l, J.) Hayward, T., LE PRO~T LONGVLAN- 
LIEGE, DhZT£LOPPSMENTS THESAURIQUES, Congr~s du IASLA, 
Liege, Novembre 1981 

No~l et al. 1981 = No~l, J., Michiels, A. Mullenders, J.,LE PROJET LONGMAN- 
LIEGE, Congr~s sur la lexicographie ~ l'~ge ~lectronique, 
Luxembourg, 1981 

Olney 1968 = Olney, J., To all interested in the Merriam-Webster 
transcripts and data derived from them. Systems 
Development Corporation Documlent L-13579 

Ralph 1980 = Ralph, B.) Relative Semantic Co~lexity in Lexical 
Units , in COLING 80, 1980, p. 115-121 

Smith and ~\]axwell 1975 = Smith, R. and Maxwell, E., An English dictionary for 
con~uterized syntactic and s~antic processing, 
International Conference on Computational Linguistics, 
Pisa, 1973. 
